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CHAPTER  1 
INTRODUCTION 

1.1.  Problem  Statement  and  Obiectives 

Many  researchers  have  concluded  that  only  a portion  of  what  a 
computer  system  does  contributes  directly  to  the  desired  computation. 

The  rest  is  overhead  required  of  the  program  to  work  around  awkward 
hardware.  This  situation  has  come  about,  in  part,  because  of  a lack  of 
understanding  of  the  behavior  of  computer  programs  and  the  relationship 
of  that  behavior  to  the  system  hardware.  Essentially,  many  areas  of 
computer  design  are  approached  in  an  ad  hoc  manner.  One  area  where  this 
is  true  is  that  of  CPU/memory  communication  (including  address  generation). 
Since  considerable  overhead  goes  into  managing  this  communication,  it  is 
desirable  to  analyze  or  characterize  the  essential  characteristics  of 
the  address  generation  process  more  precisely.  For  example,  such 
characterization  is  important  in  the  realm  of  Large  Scale  Integration 
(LSI),  where  tremendous  functional  capability  can  exist  within  a single 
integrated  circuit,  but  the  communication  bandwidth  between  circuits, 
e.g.  CPU  and  memory,  is  extremely  limited. 

In  this  dissertation  the  area  of  CPU/memory  communication  is 
exami.—d.  In  particular,  methods  are  presented  for  deriving  the  information 
content  of  the  CPU/memory  reference  stream.  The  information  content  is 
then  used  as  a measure  of  the  efficiency  of  the  addressing  overhead 
incurred  by  a particular  CPU/memory  addressing  architecture.  Our 
principal  objectives  are  to  characterize  CPU  memory  referencing  behavior 


and  the  addressing  process  within  the  CPU  that  implen.ents  the  memory 
reference  stream,  and  to  discover  computer  architectures  which  perform 
this  memory  referencing  function  much  more  efficiently. 
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1.2.  Background 

Although  there  is  no  research  known  to  us  in  the  area  of 
analyzing  general  purpose  computer  system  addressing  and  memory  referencing, 
there  has  been  some  work  done  in  applying  information  theoretic  concepts  to 
other  aspects  of  computer  system  evaluation.  Constantine  [1]  has  briefly 
examined  the  information  content  of  a typical  computer  word.  Hehner  [2], 
and  Foster  and  Gonter  [3]  have  looked  at  improved  opcode  encoding,  taking 
advantage  of  the  highly  sequential  aspects  of  the  opcode  stream.  Hehner 
has  shown  that  by  using  his  techniques,  program  size  can  be  reduced  by 
as  much  as  75%  in  some  cases.  Clark  [4]  has  used  frequency  based  encoding 
to  optimize  LIST  data  structures  to  minimize  access  times. 

All  these  researchers  have  reached  the  conclusion  that,  in 
most  of  today's  general  purpose  computing  systems,  significant  compaction 
of  program  storage  space  in  memory  and  significant  reductions  in  program 
execution  time  can  be  achieved  by  more  efficient  coding  and/or  improved 
system  architecture. 


1.3.  Plan  of  Development 

In  Chapter  2 an  executing  program  is  shown  to  consist  of  an 
addressing  process  and  a computation  process.  From  the  addressing  process 
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concept,  the  addressing  overhead  and  addressing  bandwidth  requirements 
can  be  formally  defined.  Also  the  information  theoretic  concept  of  nth 
order  entropy  is  applied  to  analyze  the  memory  reference  stream  of  the 
computation  process. 

It  is  shown  that  theoretically  this  entropy  is  a lower  bound 
for  the  addressing  overhead  and  is  therefore  a good  basis  for  measuring 
addressing  efficiency.  Techniques  are  developed  and  a system  of  computer 
programs  presented  for  determining  the  estimated  addressing  overhead  and 
entropy  of  an  actual  program  execution.  Numerical  results  are  then  shown 
that  demonstrate  the  considerable  inefficiency  that  exists. 

Chapter  3 briefly  examines  several  techniques  for  improving 
addressing  efficiency  within  the  context  of  standard  architectures.  These 
improvements  include  reducing  I'he  number  of  address  regi.sters  and  the 
maximum  displacement  to  more  accurately  reflect  program  requirements, 
improving  the  allocation  of  address  registers  by  the  compiler,  and 
improving  the  indexing  capabilities  of  the  architecture. 

Chapter  4 presents  a radical  approach  to  the  design  of  the 
memory  subsystem.  In  particular  the  concept  of  second  order  memory  is 
introduced.  This  memory  architecture  improves  performance  by  taking 
advantage  of  the  higher  order,  predictable  behavior  of  a typical 
referencing  stream.  Two  examples  are  analyzed  in  detail,  clearly  demon- 
strating, that  if  one  is  willing  to  pay  the  price  of  increased  space  and 
difficulty  of  use,  addressing  overhead  can  be  reduced  by  an  order  of 
magnitude  or  more. 


Chapter  5 then  summarizes  the  major  results  and  conclusions  of 
this  research  and  presents  a few  topics  on  which  additional  research 


would  be  most  useful. 
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CHAPTER  2 

INFORMATION  CONTENT  OF  REFERENCE  STREAMS 
2«1.  Introduction 

In  analyzing  the  information  content  of  the  memory  address 
(reference)  trace  of  a program,  an  attempt  is  made  to  isolate  that  part 
of  the  trace  inherent  to  the  computation  in  the  program  from  the  part 
devoted  to  address  generation  performed  by  the  program.  The  program  is 
accordingly  divided  into  a computation  process  and  an  addressing  process. 
The  average  information  content  per  reference  of  the  computation  process, 
H,  is  then  analyzed. 

In  this  chapter  the  variable  H is  defined  as  well  as  a method 
for  estimating  H.  Also  defined  are  variables  B,  the  number  of  bits  of 
address  per  reference  actually  sent  to  the  memory  per  computation  reference 
and  A,  the  number  of  bits  input  to  the  addressing  process  per  computation 
reference . 

In  addition,  the  methods  used  by  a system  of  analysis  programs  to 
compute  values  for  the  above  variables  are  presented. 


2,2.  The  Addressing  Process 

Before  defining  the  addressing  process,  the  concept  of  R+D 
architecture  is  introduced. 

Definition  2.2.1:  An  R-H)  architecture  forms  a memory  address  by  taking 


the  contents  of  a base  register  R and  adding  some  displacement  D. 
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Memory  address  = (contents  of  R) +D,  where 
r = Flog2  (number  of  registers^  and 

d = flog^  (maximum  displacement  in  locations)! • O 

Now  r+d  is  the  number  of  bits  required  in  an  instruction  encoding  to  name 
a register  and  specify  a displacement. 

The  R-H3  architecture  is  important  because  most  addressing 
architectures  use  variations  of  R+D  addressing.  Table  1 shows  a few  of 
the  addressing  modes  in  common  use  and  some  common  computers  that  use 
them.  Paged  addressing,  for  example,  requires  a page  select  register 
which  is  often  loaded  implicitly  by  indirection.  Indirection  is  actually 
just  the  automatic  reloading  of  the  memory  address  register.  Consequently 
each  form  of  addressing  has  its  own  method  for  loading  into  and  displacing 
from  explicitly  or  implicitly  addressed  base  registers.  So  in  reality, 
the  other  addressing  modes  are  just  variations  of  the  first.  Thus  it  is 
reasonably  general  to  devote  our  attention  to  a strict  R+D  architecture. 

Throughout  the  rest  of  this  dissertation,  the  IBM  360  is  used 
for  modeling  purposes.  IBM  360  program  traces  were  readily  available  to 
us.  Furthermore,  the  360  architecture  does  not  have  indirection  facilities, 
and  since  it  would  have  added  additional  complexity  to  our  models  without 
being  useful  in  our  examples,  we  did  not  consider  indirection  (except 
briefly  in  the  section  on  compilers  in  Chapter  3).  Future  research  in 
this  area  may  very  well  include  the  explicit  modeling  and  analysis  of 
multiple  level  indirect  addressing.  That  extension  to  the  model  used 
here  should  be  straightforward. 
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Table  1 

Addressing  Types 

Base  + Displacement 

IBM  360,  PDP-11 

Paged 

HP  2100,  PDP-8 

• Indirection 

PDP-11,  HP  2100 

• Auto-Index 

PDP-11 

• PC-Relative 

PDP-11 


Table  2 

IBM  360  Instruction  Types 


Tvpe 

Memory 

Addressing  Mode 

RR 

- Register  Reference 

None 

RX 

- Storage  Reference,  indexed 

(B)  + (X)  +D 

RS 

- Register  and  Storage  Reference 

(B)  +D 

SI 

- Storage  and  Intermediate 

(B)  +D 

SS 

- Storage  and  Storage 

(B)+D,  (B)+D 

(Where  (B)  signifies  contents  of  a base  register, 

(X)  signifies  contents  of  an  index  register,  and 
D is  an  integer  value  representing  displacement,) 
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The  IBM  360  addressing  architecture  lends  itself  quite  well  to 
this  analysis,  since  it  has  easily  delineated  index  operations  and  is 
basically  an  R+D  architecture.  Table  2 shows  the  IBM  360  instruction 
types  and  the  addressing  mode  used  by  each.  Therefore  by  analyzing  a more 
or  less  pure  R+D  architecture  like  that  of  the  360  we  feel  we  are  performing 
a reasonably  general  analysis. 

Program  execution  now  can  be  divided  into  two  interleaved 
processes;  the  computation  process,  which  performs  actual  computational 
work,  and  the  addressing  process,  which  calculates  the  references  to  memory 
needed  by  itself  and  by  the  computation  process  (see  Figure  1).  These 
are  modeled  as  follows. 

Let  A be  the  total  number  of  bits  flowing  into  the  CPU  to 
perform  the  addressing  function  averaged  over  the  total  number  of 
references  made  by  the  computation  process.  The  following  algorithm 
filters  out  the  addressing  process  references  and  defines  A for  a 360 
program  trace.  (It  is  possible  to  specify  a general  algorithm  for 
filtering  out  the  addressing  process  for  any  architecture.  However, 
since  all  our  examples  are  from  the  360,  a more  specific  360  oriented 
algorithm  is  presented.  Adaptation  of  this  algorithm  for  some  other 
computer  would  not  be  difficult.) 

Filter  Algorithm 

Scan  the  instruction  trace  in  the  reverse  direction. 

When  a register  is  used  for  base  or  index  addressing,  place  it  on  the 
active  base  or  index  register  list,  respectively. 
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Data  and  Instructions 
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when  an  instruction  operates  on  an  active  register,  delete  that 
instruction.  If  the  instruction  was  a load,  deactivate  the  destination 
register  and  place  the  source  memory  location  on  the  active  storage 
list . 

When  a store  into  an  active  storage  location  is  encountered,  delete 
the  instruction,  deactivate  the  storage  location,  and  activate  the  source 
register . 

- When  loading  or  doing  arithmetic  from  any  register  to  an  active  register, 
activate  the  source  register,  delete  the  instruction,  and  if  the 
instruction  is  a load,  deactivate  the  destination  register. 

A is  computed  as  the  sura  of  all  bits  of  deleted  instructions,  all  the 
bits  of  data  fetched  by  their  associated  memory  references,  all  address 
register  (base  and  index)  select  bits  and  all  displacement  bits  (see 
Definition  2.2.1),  and  then  dividing  the  total  by  which  is  the 
total  number  of  references  made  by  the  instructions  and  partial 
instructions  that  remain  after  filtering  (i.e.,  those  not  deleted).  □ 

In  all  but  one  case  we  assumed  three  bits  instead  of  four  for 
address  register  select,  since  for  all  but  one  of  our  traces  at  most 
seven  or  eight  registers  were  active  at  one  time  for  addressing. 
Consequently,  the  execution  of  a single  memory  reference  (RX)  instruction 
usually  incurs  18  bits  of  addressing  overhead  (3  for  index  register 
select,  3 for  base  register  select,  and  12  for  displacement).  Also  two 
bits  are  added  to  each  computation  instruction  executed,  since  there  are 
basically  four  addressing  operations  that  need  to  be  delineated; 
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- RX,  RS,  or  SI  memory  reference, 

SS  memory  reference. 

Base  Register  Load,  and 

No  memory  reference  (RR  instruction). 

This  algorithm  filters  out  only  active  addressing  activity. 

This  means  that  if  an  address  register  is  set  up  in  anticipation  of  a 
branch  that  is  never  taken  (or  an  operand  that  is  never  accessed),  that 
set  up  operation  is  not  filtered.  This  is  one  of  the  gray  areas  of 
algorithm  operation,  since  it  is  impossible  for  the  algorithm  to  detect 
"unfulfilled"  addressing  process  activity. 

Definition  2.2.2:  For  the  ith  word  read  from  memory  by  a trace,  let 

w.  = the  number  of  bits  in  the  word, 

a^  = the  number  of  bits  in  that  word  which  are  associated  with  the 
addressing  process;  i.e.,  those  bits  deleted  by  the  filter 
algorithm  from  the  reference  word  (a^=w^  in  many  cases),  and 
~ ~ “ those  bits  of  w^  associated  with  the  computation 

process,  i.e.,  the  number  of  bits  in  the  referenced  word  which 
remain  after  the  filter  algorithm  is  run.  Q 

The  computation  process  then  operates  given  instructions  and 
data,  c^,  fetched  from  the  memory.  Computational  work  is  performed  on  the 
data . 

The  addressing  process  operates  given  instructions  and  data,  a^, 
from  the  stored  program  in  memory.  The  addressing  process  is  responsible 
for  creating  all  memory  references  needed  by  both  processes. 


I 


1 

I 


J 


c 


where  is  the  number  of  computation  process  references.  That  is,  let 

= total  number  of  references, 

N = number  of  references  with  c.  =0 
a 1 

(N  = number  of  references  for  instructions  rather  than  number  of 
a 

instructions,  since  some  instructions  require  multiple  references),  and 

= number  of  data  references  in  the  addressing  process,  we  get 
a 


a 

Note  that  N includes  all  referenced  words  with  c.  ^0.  In  an  infinite 
c 1 

process,  we  define  A as 

A = lim  A.  Z 

N — “ 

‘ c 

A then  is  the  average  number  of  addressing  bits  input  into  the 
CPU  per  computation  reference.  The  addressing  process  can  be  visualized 
then  as  a finite,  deterministic  ancoder  which  generates  the  entire  memory 
reference  stream. 

Definition  2.2.4:  B is  the  number  of  bits  of  address  which  are  sent  to 

the  memory  per  computation  reference.  □ 

Until  Chapter  4 it  is  assumed,  as  in  conventional  computers, 
that  the  number  of  bits  being  sent  to  memory  consists  of  fixed  size 


packets  of  w bits  each  so  that 
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c 

where  w = log^  (memory  size).  For  the  360,  w is  assumed  to  be  32  bits. 

In  Chapter  4 a more  general  formula  for  B will  be  developed,  which  is 
compatible  with  a broader  class  of  architectures. 

The  next  section  introduces  the  concept  of  entropy  and  defines 
the  variable  H.  It  is  then  shown  that  this  variable  is  a lower  bound  for 
A and  consequently  A vs . H provides  a measure  of  addressing  efficiency  of 
the  system  architecture  operating  on  typical  object  code  compiled  for  the 
svstem. 


2.3.  Entropy 

Entropy  is  used  by  information  theorists  as  a measure  of  the 
uncertainty  of  a random  variable. 

Definition  2.3.1:  The  entropy,  H(S),  of  a random  variable,  S,  which 

arises  from  an  unconditional  probabilistic  process  is  the  average  self 
information ; 

H(S)  = -I  p(s  ) log  p(s . ) 

L L 

where  the  self  inforir^tion  of  the  occurrence  of  element  s.  of  S is  defined 
1 

as  -log  p(s^),  the  logarithm  of  the  probability  of  occurrence  of  element 
s..  C 

Thus,  for  example,  if  S takes  on  one  of  three  values  with 
probability  1/2,  1/4,  and  1/4,  the  information  conveyed  by  the  occurrence 
of  each  of  these  values  is  1 bit,  2 bits,  and  2 bits,  respectively. 
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The  expected  information  conveyed  by  a value  of  S,  the  self  information 
of  S,  is  1.5  bits.  A Huffman  encoding  of  the  three  values  which  transmits 
no  more  bits  than  the  information  content  is  0,  10,  and  11,  respectively. 

Many  processes  have  considerable  structure . i.e.,  they  are  very 
predictable  given  sufficiently  high  order  conditional  probabilities. 
Extension  to  higher  order  requires  a more  involved,  but  more  useful 
definition  of  entropy.  Let  mj  e M,  where  M is  the  s'et  of  memory  locations 
accessible  by  a program  and  | m1  = 2^,  i.e.,  the  memory  size.  Let  S be 
the  random  variable  that  represents  the  memory  referencing  process.  S 
takes  on  values  which  are  successively  referenced  memory  addresses,  in  a 


probabilistic  fashion. 

Definition  2.3.2:  An  n gram,  g^,  is  an  ordered  set  of  n addresses;  e.g., 

9 

if  we  have  two  locations  a and  b,  all  possible  2 grams  would  be  g“  = aa, 

2 2 2 n 

g2  ~ ab,  g^  = ba,  = bb.  ^ is  the  set  of  all  possible  n-grams . c: 
Using  the  n gram  entropies  of  Shannon  [5]  the  absolute  entropy 
is  defined. 

Definition  2.3.3:  Let  m^  be  an  arbitrary  reference  to  the  location  in 

memory.  Let  g^  ^ be  the  n-1  gram  of  the  previous  n-1  references.  The 


zeroth  order  entrot 


is 


Hq(S)  = log^lnl  = w, 

W 

if  all  2 locations  of  memory  are  accessed  by  the  program.  Note  that  Hq(S) 
actually  represents  the  value  log^lM'j,  where  M'  is  the  set  of  referenced 
memory  locations,  M = M.  All  locations  of  memory  are  not  always  referenced, 


so  in  general 
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Hq(S)  = log^Ul'l  <1  log^lM 


The  nth  order  entropy  is 


" '"v,  n .P(ra.,g^’^)log  p(m  Ig"  S 

J g,-  su 


and  the  absolute  entropy  is 

H(S)  = lim  H (S).  = 

It  should  be  noted  that  calculation  of  H(S)  for  a finite  program 
trace  is  impossible,  since  as  n approaches  the  length  of  the  trace, 
approaches  0 and  is  no  longer  meaningful.  Therefore,  the  variable  H will 
be  introduced,  which  is  an  estimate  of  H based  on  the  general  structure, 
but  not  on  complete  knowledge  of  the  trace. 

Note  that  we  are  using  base  2 logarithms  so  that  entropy  has 
dimension  bits/reference . For  example,  with  the  above  definition 


H,  (S)  = -I  p(ra  ) log  p(m  ) 
nijCM  J J 

H (S)  = -I  Z p(m  ,m  )log  p(m  jm.  ),  and 

^ m.  sM  m.  sM  Ji  Jo  ^ Ji  Jo 

Jl  J2  12  12 

w > Hq(S)  > (S)  > • • ■ H^(S)  = H(S)  > 0. 

Note  that  = M.  Entropy  is  an  excellent  measure  of  the  unpre- 
dictability of  a probabilistic  process.  If  S represents  an  independent 
uniformly  distributed  random  variable,  H(S)  = w,  and  if  for  some  i, 
p(m^)  = 1 (a  particular  location,  i,  is  always  referenced),  H(S)  = 0, 
bounding  H(S).  Entropy  then  is  a measure  of  the  "uncertainty"  or 
"unpredictability"  of  the  random  variable  S;  nth  order  entropy  is  a measure 
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of  the  unpredictability  of  the  next  reference  given  knowledge  of  the 
previous  n-1  references.  It  is  important  that  entropy  is  a memory 
allocation  independent  measure  of  predictability  in  contrast  to  locality 
which  is  allocation  dependent. 

Before  presenting  the  key  result  of  this  section,  it  is  useful 
to  describe  the  abstract  model  of  the  addressing  process.  When  a program 
is  being  executed  by  a CPU,  all  or  part  of  some  of  the  words  fetched  by 
the  CPU  constitute  input  to  the  addressing  process.  That  process  as  it  is 
executed  under  program  control  by  the  hardware  can  be  modeled  as  a finite- 
state  machine.  This  machine  accepts  as  input  the  data  words,  instructions, 
and  portions  of  instructions  which  we  defined  as  the  addressing  process 
input  in  the  previous  section.  In  an  actual  system  these  codewords  flow 
into  the  CPU  over  the  data  bus.  This  same  bus  also  transmits  the 
computation  process  inputs.  Hence  the  variable  length  property  of  the 
input  codes  is  essential  even  though  the  bus  itself  and  the  data  words 
on  it  may  have  constant  width.  The  output  of  the  finite  state  machine 
representing  the  addressing  process  is  the  actual  string  of  memory 
locations  (outputs  to  the  address  bus)  referenced  by  the  computation 
process  only.  The  state  of  this  machine  may  be  thought  of  as  the  cross 
product  of  the  states  of  all  registers  (or  partial  registers)  used  by  the 
addressing  process. 

At  this  point,  the  addressing  process  is  abstractly  thought  of 


as  generating  only  the  references  for  the  computation  process.  This  means 
that  for  this  section,  the  variable  A represents  just  that  overhead 
required  to  generate  the  computation  process  memory  reference  stream. 


Of  course  the  actual  values  for  A in  the  results  section  will  also  include 
some  overhead  required  by  the  addressing  process  for  generating  its  own 
references.  This  makes  A only  slightly  larger  than  if  such  overhead  were 
exc luded . 

Since  our  decoder  is  a finite-state  machine,  it  therefore  can 
be  thought  of  as  a decoder  which  uses  a finite  number  of  different 
codebooks.  The  current  codebook  being  used  is  determined  by  the  state 
of  the  machine.  Codebooks  can  be  changed  by  certain  codewords  which  are 
state  dependent.  Different  codebooks  then  map  the  same  input  code  set 
onto  the  same  output  set  in  different  ways.  What  this  means  is  that 
depending  on  the  initial  state  of  the  machine  a finite  length  string  of 
codewords  input  into  the  machine  could  give  two  or  more  distinct  output 
strings . 

Therefore  we  have  a finite-state  decoder  accepting  an  input 
code  set  of  variable  length  codewords,  and  outputting  elements  from  the 
set  of  memory  addresses  (in  other  words,  the  memory  reference  stream  of 
the  computation  process).  The  variable-length  input  codeword  set  must  be 
uniquely  decodable.  A code  is  uniquely  decodable  if  for  each  source  sequence 
of  finite  length,  the  sequence  of  code  letters  corresponding  to  t'nat  source 
sequence  is  different  from  the  sequence  of  code  letters  corresponding  to 
any  other  source  sequence  [6] . 

Recapitulating  then,  H(S)  is  the  absolute  entropy  of  the  stream 
of  computation  references,  i.e.,  machine  outputs,  which  are  the  memory 
references  required  by  the  computation  process.  .A  is  the  average  length 
in  bits  of  the  codewords  input  into  our  finite-state  "addressing"  machine 
per  computation  reference. 


18 


Figure  2 illustrates  the  entire  process.  In  Figure  2a,  a 
precise  diagram  of  the  relationship  of  the  addressing  process  to  the 
computation  process  is  shown.  Figure  2b  shows  the  abstract  model  of  the 
addressing  "communication  channel."  The  decoder  is  the  finite-state 
machine  model  of  the  addressing  process.  A represents,  in  bits/symbol, 
a code  for  selecting  codebooks  and  inputs  to  the  selected  codebook  which 
cause  the  addressing  process  to  generate  the  S string.  The  encoder  is 
then  an  imaginary  process  which  codes  S into  an  efficient  .A  string.  A 
similar  encoding  function  is  normally  performed  by  the  compiler. 

Theorem  2.3.1:  If  a finite  state  deterministic  decoder  is  used  to  convert 

A to  B,  then 

A > H(S) . C 

Before  Theorem  2.3.1  can  be  proven  two  useful  lemmas  and  a 
theorem  are  given. 

Lemma  2.3.1:  Let  p^ , p^ , . . . , P^j  and  , . . . , be  arbitrary  positive 

mm'  "M  ‘ M 

numbers  with  S p.  = Z a.  = 1.  Then  - Z p.  log  p.  < - Z p.  logo,  with 
i=l  1 i=l  i=l  1 ^ * ‘1 

equality  if  and  only  if  Vi. 

Proof  is  given  in  Ash  [ 7] . Z 

Lemma  2.3.2:  For  any  uniquely  decipherable  binary  code  with  word  lengths 

n^jn^,  . . . ,n^j 

M -n. 

Z 2 <1. 

i=l 

Proof  is  given  in  Ash  [8] . c 

Theorem  2.3.2:  Let  S be  a random  variable  which  represents  a source 

generating  symbols.  Let  .A^  be  the  average  number  of  bits 
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required  Co  code  each  source  output  in  binary  digits,  given  that 
the  encoder  and  decoder  have  only  knowledge  of  the  previous  n-1  symbols 
generated.  is  then  the  average  code  length  needed  to  code  the  next 

symbol,  knowing  the  n-1  gram,  g^  ^ , of  previous  symbols.  Then 


A„>  H„(S). 

Proof : (Similar  to  that  of  Theorem  2.5.1  in  .Ash  [9]  .)  Define 

X = S p(ra. ,g . )n.  I . 

n i,j  1 J i1j 

where  j j the  length  of  the  code  word  assigned  to  ra^  when  the  n-1 
gram  g”  ^ has  just  been  transmitted.  Then 

A = E p(gr^)A„(surs 


where 


Now 


or 


where 


A (slg"?  = E p(m.  j g"^  ^)n.  I . . 
n'°j  ii'j 


H (S)  = - E p(m  ,g"  ^)log  p(m  |g^ 


H (S)  = E pCg^'^H  (S|g"’b 

n J “ J 


H (Slg"'^)  = -E  p(m  lg^‘^)log  p(ra  Is"  ^). 
First  we  shall  prove 


> H„(sU”-h 


n-1 . 


n-1. 


Multiplying  A(S|gj  ) by  log22(=l)  we  have 

(log22)A(sl = -E  p(m^lg"  ^)log22 


-n . 
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, by  Lemma  2.3.1  we  have 


-?  P(m.|gV  '■)log.,p(ra  |g“  '■)  < -E  P(m  Ig'?  ^)log  q 

j_lJ  Z1 


-n.  I . 
J 


H(s|g"  ) = -E  p(m  Ig^  )log  p(m  |g‘!  ) < -E  p (m  | g"  )logo  

J 1 i ^ \ '“jl; 

\E  2 J 

V ‘ 

H(sl  g^  ^)  < -S  P(m.  1 g^  ^)log  2 ^ + S p(m  1 g^  ^)log  (E  2 ’^1^) 

J ^ iij  ^ z 


HCsU^'b  < A (Slg'j'b  + log  (E  2 
J n j I ^ 


? P(m. 1 g"’^)  = 1. 
1 ^ J 


By  Lemma  2.3.2,  S 2 < 1 for  any  uniquely  decipherable  code,  then 


-n.  I . 

log-(E  2 ^'^)  < 0 

^ L 


Therefore 


log„y  < 0 for  0 < y < 1. 


A„(s|8°-  )>H„(sU^-  ), 


Since  the  inequalities  hold  between  and  for  all  j,  by  multiplying 
both  sides  of  the  inequality  by  P(gj  and  summing  over  j we  get 

2 P(gfbA  (sig'j'b  > I PCg^’^H  (sjg^'b 

jJ“J  jJUJ 


A > H (S). 
n — n 


1 
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Basically  the  above  theorem  is  applicable  whenever  the  encoder 
and  decoder  know’  only  the  previous  n-1  symbols.  Since  CPUs  use  registers 
to  remember  addresses  and  operands,  the  CPU  can  pass  information  forward 
indefinitely,  beyond  any  finite  n gram  that  may  be  stored.  Hence  the 
concept  of  infinite  memory  encoding  (unbounded  memory)  is  necessary  to 
allow  modeling  of  the  memory  of  unbounded  history  brought  forward  by  the 
contents  of  the  CPU  addressing  registers.  This  CPU  behavior  is  precisely 
an  infinite  memory  coding  situation  , since  the  same  finite,  arbitrarily 
long  subsequence  of  instructions  can  refer  to  two  or  more  distinct  locations. 

When  a process  has  tightly  coupled  codeword  groupings,  it  is 
possible  to  beat  the  standard  Huffman  encoding  for  the  entire  ensemble 
by  Huffman  encoding  each  group  and  changing  codes  when  a new  group  is 
entered.  This  technique,  of  course,  requires  a decoder  with  memory  and 
intelligence.  This  process  is  exactly  what  occurs  in  an  R.+D  architecture 
when  base  registers  are  loaded  to  cover  current  localities.  For  Theorem 
2.3.2  to  apply  to  our  finite-state  machine  then  we  must  let  n go  to 
infinity  as  stated  in  the  following  corollary,  thus  proving  Theorem  2.3.1. 
Corollary  2.3.1:  For  an  infinite  memory  coding  system 

A > lim  H (S)  = H(S) . C 

n — 00  n 

No  proof  is  given  since  this  is  the  most  fundamental  result  of  information 
theory,  i.e.,  one  cannot  code  at  a rate  which  is  lower  than  the  absolute 
entropy  of  a probabilistic  source.  In  a sense  we  have  let  n go  beyond 
any  past  knowledge  the  coder  may  bring  forward . 

Corollary  2.3.1  implies  that  a program  must  be  initial  state 
independent,  i.e.,  it  should  assume  nothing  of  the  initial  register 
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contents  (reloading  before  using).  This  stems  from  the  fact  that  the 
coder  must  always  be  able  to  operate  correctly  by  looking  at  the  last 
n references  as  n goes  arbitrarily  large.  This  assumption  is  not 
unrealistic  since  all  computer  systems  require  initial  register  contents 
to  be  set  up  for  a program  by  each  program  itself  or  by  the  operating 
system  each  ti.me  a program  is  activated. 

H(S)  is  thus  a lower  bound  for  A,  making  A vs.  H(S)  a good 
measure  of  addressing  efficiency.  Though  H(S)  is  upper  bounded  by  w,  A 
is  not  upper  bounded  and  could  be  infinite.  (If  for  example  no  computation 
is  occurring  at  all,  = 0 . ) Although  in  practice  A is  much  larger  than 
H(S),  one  can  visualize  a machine  where  A = H(S)  = w.  Basically,  if  we 
have  HCS)  = w,  a random  referencing  process,  the  most  efficient  addressing 
architecture  would  be  for  the  addressing  process  to  fetch  the  next  address 
directly  from  the  memory  each  time  making  A = w.  So  even  though  A can 
be  infinite,  A should  not  be  greater  than  w in  a reasonable  system. 

If  the  coding  process  has  infinite  memory,  we  cannot 
guarantee  A^  > H^(S)  for  any  finite  n.  This  is  due  to  the  fact  that  the 
CPU  can  always  have  some  knowledge  of  past  behavior  (>n  away)  which  can 
be  used  to  code  more  efficientlv  than  H (S),  since  H (S)  assumes  no 
knowledge  beyond  the  last  n-1  gram  of  references.  It  is,  however, 
interesting  that  with  real  programs  an  R+D  architecture  has  difficulty 
coding  much  better  than  (Chapter  3).  Therefore  such  an  architecture 
does  not  make  effective  use  of  higher  order  information. 

The  calculation  of  H„(S)  from  a finite  trace  is  of  course 
impossible.  The  next  section  presents  an  algorithmic  approach  to  computing 
H(S),  an  estimate  of  H(S). 
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2.4.  Estimation  Theory 

Since  actual  computer  programs  are  not  infinite  processes,  a 
method  is  needed  for  finding  an  estimated  entropy  H(S)  of  an  actual 
reference  stream.  The  following  method  has  been  used  to  find  H(S)  for 
IBM  360  program  traces.  This  method  tries  to  take  advantage  of  the 
"structure"  of  the  program.  Basically  a model  of  program  execution  is 
derived  using  this  structure,  with  the  entropy  H(S)  being  the  entropy  of 
the  model. 

Note  that  the  H(S)  that  is  calculated  from  these  traces  is, 
like  the  traces  themselves,  data  dependent.  It  is  precisely  valid  only 
for  a particular  run  of  a particular  program. 

Initially,  a computation  reference  trace  is  created  from  a 
filtered  360  instruction  execution  trace.  Given  the  reference  trace,  the 
following  constructs  are  used  to  create  a block  graph  of  program  execution: 
Definition  2.4.1:  A Ramp  is  a set  of  sequentially  executed,  nonbranching 

instructions  terminated  by  a branch  instruction  (conditional  or  uncondi- 
tional). A ramp  may  be  entered  only  at  the  first  instruction.  ^ 

Definition  2.4.2:  A Block  is  a set  of  one  or  more  sequentially  executed 

ramps  where  all  entry  points  to  the  block  are  to  the  beginning  of  the 
first  ramp  in  the  block  and  all  exit  points  from  the  block  only  leave  the 
last  ramp  in  the  block;  see  Figure  3.  - 

Blocks  and  ramps  are  totally  sequential,  i.e.,  no  looping  is 
allowed  within  them. 


Blocks  actually  could  be  constructed  from  a trace  in  one  pass, 
but  it  has  been  found  that  a two  pass  (ramps,  then  blocks)  distillation 


26 


process  works  much  more  smoothly.  This  is  true,  primarily  because  of 
the  large  number  of  references  one  must  keep  track  of  when  creating 
blocks  directly  from  the  reference  trace. 

One  problem  that  can  occur,  is  that  a ramp  can  have  an  arc 
entering  in  the  middle.  SUPER,  the  program  that  determines  the  blocks 
from  the  ramp  graph  simply  moves  such  an  arc  to  the  beginning  of  the  ramp 
rather  than  split  the  ramp.  This,  however,  occurred  for  only  two  ramps 
within  one  program  trace  (LIST)  and  therefore  hardly  affected  the  accuracy 
of  H(S). 

A block  graph  is  now  created  (see  Figure  4)  where  each  arc  from 
node  i to  node  j is  labelled  f , representing  the  execution  frequency  of 
that  arc.  Note  that 


S f^j  = , the  total  execution  frequency  of  node  i. 

Also  an  arc  is  added  from  the  last  block  executed  to  the  first  executed 

with  f,  ^ ^ = 1.  This  construction  results  in  an  irreducible,  i.e., 

last, first  ’ " 

any  state  can  be  reached  from  any  other,  Markov  process.  The  node  transi- 


tion probabilities  can  be  estimated  from  the  arc  frequencies 


f.  . . 


the  transition  probability  from 
node  i to  node  j , 


Since 


f . . 

p^ . = , is  an  estimated  transition  probability. 

To  get  a rough  idea  of  the  error  introduced  by  such  finite 


sampling,  the  expected  first  order  entropy  calculation  has  been  shown  by 
Basharin  [ 10]  to  be 
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E[Hj^(S)]  = H^(S)  log2^  + O ("j) 


where  q = number  of  symbols,  1m'|,  and 

N = sample  size,  number  of  references  in  trace. 

For  our  traces,  typically  q *=  1100  and  N ^ 80,000,  which  gives  an  error 
of  about  27o.  As  n increases  in  the  finite  sampling  error  also  increases. 
This  effect  poses  an  upper  limit  on  the  order  of  H which  can  be  established 
by  this  method. 

After  estimating  the  block  transition  probabilities  P^j j we  can 
easily  compute  the  stationary  probabilities  q^,  i.e.,  the  probability  of 
entering  state  i on  an  arbitrary  state  transition,  for  each  block.  These 
exist  sin^e  the  process  is  irreducible  and  ergodic.  The  process  is 
ergodic,  i.e.,  it  reaches  a unique  steady  state,  since  in  all  our  programs 
there  has  existed  at  least  one  block  with  an  arc  to  itself.  This  is  a 
sufficient,  though  not  a necessary  condition  of  ergodicity  [11].  Each 
block  is  now  expanded  as  in  Figure  5 to  include  both  its  instruction  and 
data  references.  Again  each  arc  is  labelled  with  its  execution  frequency. 
Definition  2.4.3:  A Program  Graph  is  an  expanded  block  graph,  in  which 

each  node  represents  a single  reference.  □ 

Definition  2.4.4:  Let  H(S)  be  the  information  content  of  the  program 

graph  as  defined  above.  D 

It  should  be  mentioned  that  H(S),  though  not  computed  by  letting 


n approach  infinity  in  H^(S),  does  nevertheless  represent  the  uncertainty, 
given  that  we  know  only  the  program  graph  of  the  program  run.  It  does 
not,  however,  trace  block  execution  "histories,"  since  such  knowledge. 


F 


though  it  would  reduce  H(S),  would  be  expensive  and  unwieldy  to  obtain, 
especially  considering  the  necessity  of  knowing  the  multitude  of  data 
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reference  'liistories . " Including  such  information  also  runs  the  risk  of 
incurring  unacceptably  large  finite  sampling  error.  Consequently,  H(S) 
represents  a good  engineering  estimate  of  H(S). 

Within  a single  block  all  data  and  instruction  references  are 
dependent  on  the  previous  instruction  reference.  All  sequentially 
accessed  instructions  then  contribute  nothing  to  the  entropy,  since  given 
the  first  instruction  of  the  block,  they  are  known  with  absolute 
certainty.  Data  references  are  also  dependent  only  on  the  previous 
instruction  and  are  independent  of  each  other.  This  assumption  adds  more 
uncertainty  than  is  perhaps  justified  by  the  model.  However,  it  was  found 
through  study  of  one  of  the  traces  (GAUSS)  that  blocks  which  had  two  or 
more  instructions  that  referenced  data  from  two  or  more  distinct  locations 
were  quite  rare.  Incidentally  by  assuming  data  reference  dependence  we 
also  eliminate  any  highly  correlated  data  referencing  behavior  that  might 
occur  between  successive  executions  of  a block  such  as  in  a loop. 
Consequently  data  references  may  be  made  in  a sequential  manner  from  pass 
to  pass  within  the  block  but  will  appear  to  be  independent  in  the  program 
graph.  This  again  overestimates  the  uncertainty,  but  is  necessary  for  a 
reasonable  computation  of  H(S). 

The  absolute  entropy  H(S)  of  a program  graph  is  given  by  the 
following  theorem; 

Theorem  2.4.1:  For  a program  graph  the  estimated  information  content  per 

reference  with  independent  data  references  is 


1 
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r . 1 „ 

H(S)  = r ;r  " N ? 

1 i ‘1 

where  n.  is  the  total  number  of  references  in  block  i,  N = Z n.q. 

1 j J J 

"i^i 

r.  = is  the  fraction  of  total  references  which  are 

^ Z n _q  _ 

^ ^ generated  by  block  i,  and 

H(s| i)  is  the  total  information  content  over  all  rpferences 

in  block  i. 

H(s|i)  = -Z  p log,p  +Zh, (i,k) 

j i-J  k “ 

where  the  p^^  are  the  estimated  block  transition  probabilities,  k spans 
the  data  references  in  block  i,  and 

H.  (i,k)  = -Z  p(i,k,i)  log  p(i,k,^) 
a I 2 

where  I spans  all  nodes  associated  with  data  reference  k in  block  i and 
p(i,k,2)  is  the  estimated  probability  of  occurrence  of  node  I during  data 
reference  k in  block  i.  C 

Before  proving  Theorem  2.4.1  the  following  lemma  is  needed. 

Lemma  2.4.1:  The  entropy  per  symbol  output  by  a Markov  source  is  given  by 

Il=(S)  = S H(sli) 

where 

H(s| i)  = -Z  p. . log  p.  , 

j ** 

and  q,^  ~ q- > Markov  stationary  probability  for  state  i,  since 

i-*-  j”/  ^ 

the  program  graph  has  only  one  irreducible  set  of  states,  which  is 
ergodic.  The  proof  is  given  in  Gallager  [12].  - 


Proof  of  Theorem  2.4.1:  To  evaluate  H(S),  let  us  consider  each  node 


(reference)  in  the  program  graph,  and  find  its  entropy.  The  possible 
cases  are : 

i)  First  instruction  in  a block:  This  entropy  is  based  on  the  first 

order  block  transition,  i — j , without  considering,  e.g.,  where  block 
i went  last  time  or  which  block  preceded  i.  The  estimated  uncertainty 
of  the  transition  from  block  i to  j is 


^lock  i-’j  '^°®2Pij' 

Associating  this  H term  with  the  end  of  block  i and  accumulating 
and  weighting  by  p^^  over  all  j (successor  blocks  of  block  i)  we  get 

«Block  i = ■?  Pij^°S2Pij. 

Exit  Transition 

ii)  Other  instructions  of  block;  Given  that  we  are  in  block  i,  these 

other  instructions  contribute  nothing  to  uncertainty,  since  they  are 
fetched  sequentially,  therefore 


H =0. 

Instructions 
Within  Block  i 

iii)  Data  reference  at  data  reference  position  k in  block  i:  Only  consider 

dependency  on  previous  instruction  fetched  (equivalent  to  knowing  k); 
e.g.,  do  not  consider  instruction  references  before  that,  any  other 
data  references  in  the  block,  or  the  data  element  referenced  at 
position  k in  block  i during  the  last  execution  of  block  i.  Of  all 
the  locations  referenced  at  this  position  throughout  the  trace, 
consider  the  one  numbered  t,  then 
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»D, 


ata  Reference  I at 
Position  k in  Block  i 


-log^pCi.k,^). 


Accumulating  and  weighting  by  p(i,k,i)  over  I we  have  then  the 
entropy  associated  with  the  data  reference  at  position  k in  block 
i 


Hd(i,k)  = -I  p(i,k,i)  log^  p(i,k,X). 

Grouping  terms,  the  entropy  associated  with  block  i is 


H(sji)  = -Z  p log,  P...  + S H,  (i,k). 

j ij  ^ ij  k “ 

Block  Transition  Data  Reference 

Associated  with  Entropy  within 

Exit  of  Block  i Block  i 


The  entropy  per  reference  for  block  i = , wl 

number  of  references  in  black  i. 


n. 

1 


here  n^^  is  the  total 


Recall  that  is  the  estimated  stationary  probability  of 
entering  block  i,  given  that  a block  transition  is  being  made.  However, 
n^  is  not  consta.nt  over  i.  Therefore  define  r^  as  the  stationary 
probability  of  a particular  reference  being  from  block  i;  i.e., 

n.  q . 


r . = 
1 


where  N = Z n.q . . 

i J J 


I H {^S  * i ^ 

Using  Lemma  2.4.1  where  q^^  ~ ^i  ^ , we  have 


H(S) 


n. 

1 


However  since  r^  = 


n.q. 

1^1 


H(S) 


= ^ Z q.H(sli). 
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The  above  theorem  states  that  by  knowing  the  number  of 
references  n^  in  block  i and  the  probability  of  being  in  block  i,  it  is 
merely  a matter  of  summing  the  data  entropy  contributions  and  the  contri- 
bution to  entropy  made  when  leaving  the  block  and  then  multiplying  by 
r^^/n^  to  get  the  total  entropy  contribution  per  reference  made  by  block  i 
to  the  absolute  entropy. 

Figure  6 shows  an  example  of  a program  graph  and  the  application 
of  Theorem  2.4.1  to  it.  Let  us  assume  for  brevity  that  the  contribution 
by  data  references  in  blocks  2 and  3 is  zero.  The  following  calculations 
can  then  be  made.  The  can  be  computed  from  steady  state  Markov 
analysis  of  Figure  6a. 


N = + q^n^  + q.n^  = j(7)  + j(4)  + |(12)  = 8, 

H^(l,l)  = 0, 

H^(I,2)  - -(ilogi  + ilogi  + |l,g|)  . 1.06, 

• -(j-logj  + jlogj  + ilogi)  . 1.5, 

H(sl  1)  = -(^log|-  + ^logi  + |-log^)  + 0 + 1.5  + 1.06  = 4.06, 

H(sl2)  = 0, 

H(S]3)  = 0, 

H(S)  = |-(4)4.06  + (y)0  + (|)0), 

H(S)  = 0.29. 
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2.5.  Analysis  Programs 

We  now  describe  the  system  of  programs  which  compute  A,  H(S), 

Hq(S),  H^(S),  and  H2(S)  (see  Figure  7).  Incidentally,  no  higher  n gram 
entropy  than  H2(S)  was  computed  due  to  the  large  sampling  error  that  would 
occur.  For  example,  applying  Basharin's  formula  [10]  to  H2(S)»  (for  GAUSS) 

a sampling  error  of  over  157,  would  be  incurred.  I 

] 

The  programs  traced  are  discussed  in  Section  2.6.  Basically 
the  traces  themselves  are  lists  of  the  instructions  executed  by  a 
particular  program  when  running  on  an  IBM  360  architecture.  Each  element 
in  the  trace  contains  the  he.xadecimal  code  for  the  instruction  executed, 
any  memory  references  made  by  the  instructions,  and  for  branches,  the 
branch  destination.  (Note;  With  the  MOVT:  or  other  multiple  reference 
instructions , only  the  first  location  is  given  by  the  trace,  since  the 
number  of  locations  referenced  is  given  in  the  instruction.)  For  condi- 
tional branch  instructions  the  destination  is  also  given  according  to 
whether  the  branch  is  taken. 

FILTl  takes  an  instruction  trace  and  reverses  it  so  that  FILT2 
can  scan  it  in  a reverse  direction. 

I 

FILT2  takes  the  reversed  trace  and  applies  to  it  the  filter  1 

algorithm  given  in  Section  2.2.  The  statistics  computed  by  FILT2  are  | 

given  in  Table  3.  The  output  is  the  reversed  computation  trace;  all  ] 

addressing  instructions  having  been  deleted.  The  total  number  of  bits 
deleted  (addressing  instructions,  their  references,  and  addressing  portions 
of  computation  instructions)  is  passed  to  PREP,  which  when  executed  will 
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Table  3 

Statistics  Computed  by  FILT  2 

Total  Instructions  Read 
Instructions  Deleted 
Index  Instructions  Deleted 
Base  Instructions  Deleted 

Average  Number  of  Active  Addressing  Registers 
Maximum  Number  of  Active  Addressing  Registers 
Active  Register  Deletions 

Number  of  Index  Register  Loads  (Register  Reference  Instructions) 

Number  of  Base  Register  Loads  (Register  Reference  Instructions) 

Number  of  Index  Register  Loads  (Memory  Reference  Instructions) 

Number  of  Base  Register  Loads  (Memory  Reference  Instructions) 

Number  of  Register  Reference  Instructions  in  Filtered  Trace 

Number  of  Memory  Reference  Instructions  in  Filtered  Trace 

Number  of  SS  Type  Instructions  in  Filtered  Trace 

Number  of  RS  and  SI  Type  Instructions  in  Filtered  Trace 

Arithmetic  Instruction  Deletions: 

Register  Reference  Instructions  Operating  on  Index  Registers 
Register  Reference  Instructions  Operating  on  Base  Registers 
Memory  Reference  Instructions  Operating  on  Index  Registers 
Memory  Reference  Instructions  Operating  on  Base  Registers 

Number  of  Stores  from  Activated  Registers 

Number  of  Load  Multiple  Instructions 
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use  this  information  to  compute  A after  calculating  the  number  of 
computation  references. 

FILT3  then  reverses  the  resulting  computation  trace  so  that  it 
can  be  scanned  in  the  forward  direction. 

PREP  takes  the  computation  trace  and,  by  simulating  the  operation 
of  an  IBM  360  CPU,  creates  a file  of  all  memory  references  to  32  bit 
words.  (Since  the  actual  references  exist  on  the  trace,  information 
regarding  the  addressing  process  is  not  needed  by  PREP.) 

All  instructions  for  which  the  number  of  locations  referenced 
depends  on  the  data  within  those  locations  (for  example,  the  EDIT 
instruction)  are  deleted.  This  is  a reasonable  approximation, since  from 
the  trace  it  is  impossible  to  determine  what  references  are  made  and  it 
was  found  that  such  instructions  occurred  rarely  in  our  traced  programs. 

Whenever  a branch  occurs  in  the  instruction  stream,  a special 
symbol  is  placed  after  the  reference  for  the  branch  in  the  output 
reference  stream.  This  information  is  used  by  COM  and  COMDAT  to  determine 
ramp  termination.  Also,  for  each  reference,  a pair  of  values  is  output 
by  PREP.  The  first  value  is  the  actual  location  referenced  in  base  10, 
and  the  second  is  the  type  of  reference;  instruction,  data  read,  or  data 
write.  PREP  also  computes  some  statistics  (see  Table  4)  and  A from  the 
information  passed  to  it  by  FILT2 . COM  creates  a ramp  trace  from  the 
the  memory  reference  trace.  Each  element  in  the  ramp  trace  contains  the 
starting  address  of  the  ramp  (unique  to  each  ramp)  and  the  length  of  the 
ramp  in  references.  Data  references  are  not  considered  here,  since  they 
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Table  4 

Statistics  Computed  by  PREP 

The  Maximum  Address  Referenced  in  Memory 

The  Minimum  Address  Referenced  in  Memory 

Number  of  References  Output 

Supervisor  Calls  Encountered 

Total  Number  of  Instructions  Interpreted 

Number  of  Instructions  Not  Interpreted 

A 


w 


will  not  be  needed  to  create  the  block  graph,  and  will  be  separately 


Cl 


handled  later. 

RAMPS  creates  the  ramp  graph  from  the  ramp  trace.  A hash 
coded  data  structure  is  used  as  an  associative  memory  (addressed  by  ramp 
starting  addresses)  to  build  a file  representing  a sparse  ramp  transition 
matrix.  In  the  ramp  transition  matrix  each  entry,  , represents  the 
number  of  times  the  direct  path  from  ramp  i to  ramp  j is  traversed.  Also 
various  statistics  about  the  ramps  are  collected;  these  are  listed  in 
Table  5. 


SUPER  takes  the  ramp  graph  (the  output  from  RAMPS)  and  determines 
the  blocks  by  the  following  algorithm: 

Block  Algorithm:  Note:  The  Block  Algorithm  operates  on  the  ramp  graph. 

The  nodes  of  the  graph  are  numbered  from  1 to  n.  Each  node  J can  have 
an  arbitrary  number  of  successors,  which  point  to  other  ramps.  All  nodes 
have  at  least  one  successor,  except  the  last  ramp  executed  in  the  trace 
which  has  none.  The  algorithm  is  therefore  started  with  the  first  ramp 
executed  (J=l)  and  terminated  when  the  ramp  with  no  successors  is 
encountered.  The  algorithm  traverses  all  ramps  at  least  once.  Its  basic 
operation  is  to  encounter  a node  (ramp)  and  stack  all  successors  if  more 
than  one  exists,  and  traverse  the  arc  leaving  the  node  if  only  one  exists. 
When  a node  which  has  been  previously  traversed  or  has  multiple  successors 
is  encountered,  the  algorithm  returns  to  the  stack,  obtains  another  arc 
and  begins  again.  Blocks  constitute  collections  of  ramps  that  have  no 
arcs  entering  the  block  except  to  the  first  ramp  in  the  block  and  no  arcs 
leaving  the  block  except  from  the  last  ramp.  While  traversing  the  graph. 


i 
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Table  5 

Statistics  Computed  by  RAMPS 

Maximum  Number  of  Distinct  Ramps  Read 

Total  Number  of  Ramps  Read 

Maximum  Frequency  of  a Ramp 

Average  Frequency/Ramp 

Maximum  Outdegree./Ramp 

Average  Outdegree/Ramp 

Total  Number  of  Arcs  in  Ramp  Graph 


i 
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the  nodes  are  labeled  with  the  current  block  number.  Let  BLOCKMJM  be 
the  current  block  number.  Let  RAMP  II.  be  a pointer  to  the  first  ramp 
in  block  I.  Initially  VI  lABEL  [l_  «-  0.  There  also  e.xists  another 
stack  which  can  have  values  placed  on  it  from  P by  PUSHSTACi;  •-  P and 
removed  from  it  and  placed  in  P by  P POPSTACK.  Initially,  BLOCKMLTi  1, 

RAMP  [ l]  ^ 1,  and  J 1 (a  pointer  to  the  firs“  ramp). 

Step  1:  WHILE  LABEL  [j]  = 0 (unlabeled) 

BEGIN 

LABEL  " j]  ^ BLOCKNL’M; 

IF  J has  only  one  successor  THEN  successor  of  J 

ELSE  BEGIN  (If  J has  multiple  successors  a new  block 

must  be  started.) 

IF  J has  no  successor  TIIEN  POPSTACK;  terminate 
program  when  stack  underflows. 

ELSE  BEGIN 

PUSHSTACK  “-all  successors  to  J; 

J POPSTACK; 

ENT); 

IF  LABEL  _ J,  = 0 (unlabeled)  THEN 

BEGIN  (start  a new  block) 

BLOCKNL'M  <-  BLOCKNTM  -P  1 ; 

RAMP[ BLOCKINlM'  - J; 

END; 

END; 
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PUSHSTACK  J; 

GO  TO  STEP  2;  (Program  drops  through  to  Step  2 whenever 
an  already  labeled  node  is  encountered.) 

STEP  2:  J <-  POPSTACK; 

Search  RAMP  ^l]  list  until  an  element  I is  found  such  that  J is 
the  starting  ramp  for  block  I,  i.e.  RAMP  [l]  = J; 

(If  a node  is  not  an  initial  ramp  for  a block,  it  and  all  its 
successors  must  be  relabeled.) 

IF  such  a J is  found  THEM  GO  TO  Step  2 (it  is  a complete  block). 
ELSE  BEGIN 

BLOCKNUM<- BLOCKNIFM  + 1 ; RAMP[bL0CKNUM"  J; 

IF  LABEL  [j]  =0  (unlabeled)  THEN  GO  TO  Step  1; 

ELSE  BEGIN 

K «-  LABEL  [jj; 

WHILE  LABEL  [jj  =K  DO 
(Keep  relabeling  until  either  multiple 
successors  or  a different  labeling,  evidence 
of  another  incoming  arc,  is  encountered.) 
BEGIN 

LABEL  [j]  ^ BLOCKNUM; 

IF  J has  more  than  one  successor  ‘jt  no 
successors  THEN  GO  TO  Step  2 
ELSE  J successor  of  J; 

END; 

END; 

END; 

GO  TO  Step  2. 


45 


46 


I 


Table  6 

Statistics  Computed  by  MARKl 

Maximum  Number  of  Different  Blocks  Read 

Total  Number  of  Blocks  Read 

Maximum  Frequency  of  a Block 

Average  Frequency/Block 

Maximum  Outdegree/Block 

Average  Outdegree/Block 

Total  Number  of  Arcs  in  Block  Graph 
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each  ramp  and  the  ramp  length  also  contains  an  ordered  list  of  the  data 
references  made  by  each  particular  execution  of  each  ramp. 

DATAC  computes  the  data  access  probabilities  p(i,k,^)  (see 
Theorem  2.4.1)  for  each  block,  i,  in  the  block  graph.  It  uses  the  output 
file  of  SUPER  which  contains  the  block  numbers,  the  starting  address  of 
each  block  and  the  length  of  each  block  in  ramps.  It  reads  the  modified 
ramp  trace  output  of  COMDAT,  creating  a list  for  each  reference,  k,  in 
the  block,  i.  It  then  computes  for  each  value  of  i,  the  frequency 
f(i,k,.^),  which  is  the  frequency  of  the  .^th  distinct  data  address  at 
data  reference  (position)  k within  block  i.  The  relative  frequencies, 
p(i,k,-2),  are  computed  as  follows; 


p(i,k,X) 


F(i,k) 


where  F(i,k)  = E f(i,k,X). 

I 


ENTROP  obtains  the  q^  and  p^^  from  the  QPREP  output  file,  the 
n^  from  the  SUPER  output  file,  and  the  p(i,k,X)  from  the  DATAC  output 
file.  ENTRCP  then  calculates  H(S)  by  applying  Theorem  2.4.1. 

H012  computes  Hq(S),  Hj^(S),  and  H2(S)  by  reading  the  computa- 
tion process  memory  trace  (output  file  of  PREP).  A hash  coded  data 
structure  is  created,  content  addressable  by  memory  address,  which  contains 
memory  locations  accessed  by  the  trace.  For  each  element  of  the  table 
the  absolute  frequency,  F^,  is  kept  as  well  as  a list  of  successors  to  it 
and  the  successor  frequencies,  f(i,j).  Then 
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P(ij j) 

F. 

1 

and 

Pi 

So 

Ho(S) 

H^(S) 

and 

H,(S) 

which  is 

H2(S) 

or 

H2(S) 

since 

H^(S) 

, where  F = I F 


2: 

j 

F. 

L 


log2  (number  of  entries  in  the  table), 

-r  Pi 

log,  P(i-1  j) 
i,J  2 

- I p(i,j)  log  p(i,j)  + .S.p(i.j)  log  p 

i,j  2 i,j  2 j 

-,^.p(i.j)  log,  P(1  J)  - H-  (S) 

i.J  2 i 

P.  log,  P.  = - ^.P(i, j)  log„  p.  . 

j J 4 J i,j  / J 


The  next  section  describes  briefly  the  programs  traced  and  the 
results  of  the  application  of  the  above  system  of  analysis  programs  to 
their  traces  . 


2.6.  Results 

The  results  of  applying  our  system  of  programs  to  several 
execution  traces  are  shown  in  Tables  7 and  8.  The  programs  were  traced 
using  a program,  Trace  360,  obtained  from  the  University  of  Waterloo,  on 
the  IBM  360/75  operated  by  the  Computing  Services  Office  of  the  University 
of  Illinois.  They  are  as  follows: 

GAUSS : A Gaussian  elimination  on  a 14 X 14  matrix.  This  program  was 

compiled  by  the  IBM  FORTRAN  G compiler. 


r 


Table  7 
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Information  Content  and  Trace  Size 


GAUSS 

ERROR 

EIGEN 

LIST 

Number  of  Computation  Instructions 

46,441 

13,762 

40,864 

42,705 

Number  of  Instructions  Deleted  by  FILT2 

15,561 

239 

13,138 

17,350 

Number  of  Computation  References 

1 

75,546 

23,306 

64,169 

65,825 

A - Addressing  Overhead  (bits/computation 
reference) 

17.2 

10.0 

17.0 

24.1 

^ 1 
H(S)  - Entropy  of  Computation  Reference 

Stream  (bits/computation  reference) 

1.64 

.010 

.495 

.329 

H,  (S)  - Instruction  Fetching  Component 

of  H(S)  j 

.079 

.0021 

.030 

.040 

Hj^(S)  - Data  Fetching  Component  of  H(S) 

1.56 

.0079 

.465 

1 

.289 

1 

' Hq(S) 

10.36 

10.74 

1 

10.80 

11.27 

Combined  Stream  / 

(S) 

7.88 

9.33 

8.62 

9.58 

1 

1.86 

1.32 

1.51 

.845 

1 

9.25 

10.42 

10.21 

10.64 

Instruction  Referencing  Stream  < 

Bj(S) 

6.36 

9.96 

8.37 

9.24 

1 

.074 

.006 

.080 

.098 

9.35 

8.43 

9.24 

9.77 

Data  Referencing  Stream  < 

! «i<s) 

j 7.86 

6.40 

6.47 

1 

7.58 

3.12 

2.53 

2.71 

1.90 

I 

t 

1 


I 


Table  8 


Rainp/Block  Structure 


GAUSS 

1 

ERROR 

Number  of  Ramps/Program 

115 

143 

Maximum  Ramp  Length 

46 

312 

Average  Number  of  Data  Accesses /Ramp 

3.55 

25.42 

Average  Number  of  Successor s /Ramp 

1.12 

1.06 

Number  of  Blocks/Program  | 

31 

23 

Average  Number  of  Ramps/Block 

3.71 

6.22 

Average  Number  of  Successors/Blcck 

1.48 

1.39 

Number  of  Blocks  Executed 

7,746 

121 

207 

185 

4.39 

1.21 

72 

2.83 

1.61 

3,209 


ERROR:  An  IBM  360  floating  point  benchmark  used  by  the  University  of 


Illinois  Computing  Services  Office.  This  program  was  also  compiled  by 
the  FORTRAN  G compiler. 

EIGEN : A program  which  finds  the  eigenvalues  of  a 14k  14  matrix  using  two 

routines  (TREDl  and  TQLl)  from  the  Eigensystem  Subroutine  Package 
(EISPACK)  of  the  National  Activity  to  Test  Software  project. 

LIST : A symbol  manipulation  program  which  builds  lists  from  an  input 

stream  and  then  traverses  these  lists.  This  program  was  compiled  by  the 
IBM  PL/1  compiler. 

The  programs  traced  here  have  quite  different  structures,  yet 
the  main  numerical  results  were  quite  consistent  for  all  of  them.  There 
were  some  variations  from  program  to  program,  but  these  numerical 
differences  did  not  contradict  any  of  the  broad  qualitative  or  comparative 
conclusions  which  can  be  drawn  from  the  data.  Therefore  one  might  expect 
that  the  conclusions  drawn  here  are  of  general  value. 

In  all  cases  only  a portion  of  the  total  program  execution  was 
traced,  albeit  a rather  lengthy  portion.  This  was  done  for  two  reasons; 
first,  it  gives  us  more  of  a "snapshot"  of  program  behavior  over  a long 
enough  interval  to  be  meaningful  rather  than  a very  long  term  average,  and 
second,  large  amounts  of  computer  time  and  storage  resources  are  required 
to  compute  H2 (S)  and  H(S). 

For  all  programs,  the  addressing  overhead  A was  much  larger 
(at  least  an  order  of  magnitude)  than  H(S),  the  information  content 
of  the  computation  reference  trace.  This  large  A suggests  that  a good 
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portion  of  the  total  bit  stream 
addressing.  We  can  compute  the 
definition.  Let 


flowing  into  the  CPU  is  devoted  to 
actual  percentage  from  the  following 


~ _ Total  Number  of  Bits  Input  to  CPU 

T Number  of  Computation  References 

represents  all  instruction  and  data  fetches.  (The  data  leaving  the  CPU 

consists  of  data  being  stored  and  the  address  or  reference  stream.)  So 

we  have 


(Total  Number  of  References  in  unfiltered  Trace-Number  of  Unfiltered 

~ _ Stores )X 32 - (Number  of  Register  Reference  Instructions ) X 16 

'T’  N ■ ! 

c 

This  variable  can  be  computed  from  the  original  trace.  For 
example,  LIST  had  A.^,  = 41.2  bits/computation  reference.  This  means  that  | 

A is  58.5%  ((A/A^)X  100%)  of  the  total  number  of  bits  flowing  into  the  | 

CPU,  or  58.5%  of  the  input  bandwidth  for  the  LIST  program.  This  is  a j 

large  value  and  gives  a good  indication  of  the  load  which  the  addressing  j 

process  places  on  the  system.  The  small  size  of  H(S)  indicates  tliat  | 

considerable  improvement  may  be  possible  in  this  area.  1 

Much  of  the  difference  between  A and  H(S)  is  due  to  the  j 

! 

architectural  limitations  of  both  CPU  and  memory  (as  will  be  show’n  in  J 

later  chapters).  It  is  not  difficult  to  see  why  A is  so  large,  since  for  ' 

every  memory  reference,  18  bits  are  required  to  specify  a mode,  a register, 

and  a displacement,  and  most  instructions  (at  least  for  our  traces)  were 

memory  reference  instructions.  Of  course  it  is  possible  to  reduce  the 

number  of  bits  used  for  register  select  and  displacement,  but  then 
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register  reloading  would  increase  (base  register  loading  is  exclusively 
address  processing).  This  trade-off  is  modeled  in  Chapter  3. 

Other  interesting  results,  in  Table  8,  include  the  small  number 
of  blocks  that  exist  in  each  program  and  the  fact  that  the  average  number 
of  successors  for  each  block  is  less  than  2.  (It  can  be  less  than  2, 
since  blocks  need  only  have  one  successor,  since  incoming  arcs  may  split 
blocks.)  This  fact  is  further  evidence  of  considerable  structure  within 
the  control  flow  of  a program,  and  is  reinforced  by  the  small  contribution 
the  block  graph  uncertainty,  H-j.(S),  makes  to  the  total  uncertainty.  As 
expected,  data  referencing,  Hj^(S),  is  the  most  significant  contribution  to 
H(S).  This  results  primarily  from  indexing  (vector)  operations. 
Consequently  with  a more  sophisticated  model  which  could  capture  the 
regular  behavior  of  vector  accessing  (indexing),  smaller  values  of  H(S) 
could  be  obtained. 

Incidentally,  LIST,  which  has  a markedly  different  character 
than  the  other  test  programs  also  has  significantly  different  generated 
statistics.  For  example,  A is  significantly  greater  than  for  the  other 
test  programs.  Also,  a larger  number  of  ramps  and  blocks  occurred  in  LIST. 
These  differences  are  due  primarily  to  the  symbolic  character  manipulation 
required  by  LIST.  Referencing  under  these  conditions,  though  not 
necessarily  less  uncertain  (H(S)  is  not  much  different  than  for  the  other 
traces),  does  require  more  overhead  in  the  360  architecture  which  is 
simply  not  as  efficient  in  performing  this  type  of  task. 


The  most  interesting  result  involves  the  low  order  n gram 
entropies.  For  all  four  programs  Hq(S)  and  Hj^(S)  are  about  the  same,  with 
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H^(S)  being  very  close  to  Hq(S).  (Note:  2 does  not  indicate  absolute 
filtered  program  size,  but  rather  the  number  of  memory  words  required  to 
hold  just  those  instructions  and  data  actually  referenced  by  the  computa- 
tion process  during  the  particular  trace  of  the  program.)  It  is,  however, 
significant  that  not  only  are  the  values  for  H2(S)  fairly  close  to  each 
other  for  all  traces,  but  they  are  significantly  smaller  than  the 
corresponding  value  of  Hj^(S).  What  this  means  is  that  if  one  has 
knowledge  of  the  second  order  probabilities,  p(i|j),  of  the  program,  one 
then  has  considerable  knowledge  of  the  referencing  behavior  as  well.  This 
result  is  intuitively  supported  by  the  following  observations: 

i)  instructions  are  generally  fetched  sequentially, 

ii)  when  branching  does  occur  there  are  usually  only  two  possible 
targets  with  one  being  more  probable  than  the  other,  and 

iii)  most  memory  reference  instructions  always  reference  the  same 
location  each  time  the  instruction  is  executed.  (Vector  or 
indexed  addressing  occurred  relatively  infrequently  in  our 
traces,  though  it  did  add  considerable  uncertainty  when  it 
did  occur.) 

Also  included  in  Table  7 are  the  low  order  n gram  entropies 
for  the  instruction  reference  stream  and  the  data  reference  stream.  Again 
the  results  are  as  expected,  ^{^(S)  being  quite  small  for  instruction 
referencing  and  relatively  large  for  data  referencing. 
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2.7,  Summary  and  Conclusions 

Using  information  theory,  the  information  content,  H(S),  of  a 
memory  reference  stream  has  been  estimated.  Also  the  memory  addressing 
overhead.  A,  was  defined.  In  the  last  section  it  was  seen  that  with 
the  programs  we  have  traced,  A H(S).  The  obvious  question  then  is  how 
can  A be  reduced  and  what  costs  are  incurred  to  obtain  that  reduction? 
Granted,  building  a special  purpose  computer  to  run  a particular  program 
would  no  doubt  enable  a tremendous  reduction  in  A but  at  a great  cost. 
However,  there  are  techniques  which  can  economically  be  applied  to  various 
aspects  of  computer  system  design  to  reduce  A significantly  for  general 
applications.  The  following  chapters  examine  some  of  these  approaches 
and  their  expected  effectiveness  based  on  the  traces  used. 
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CHAPTER  3 
MINIMIZATION  OF  A 


3.1.  Introduction 

In  Chapter  2 the  variables  A and  H(S)  were  defined;  A being  the 
addressing  overhead  incurred  by  a program  and  H(S)  the  uncertainty  or 
entropy  of  the  memory  reference  stream  for  the  computation  portion  of  a 
program.  It  was  shown  that  A>  K(S).  Actual  calculations  using  program 
traces  demonstrated  that  A (A  estimate)  was,  in  fact,  significantly 
greater  than  H(S)  (H(S)  estimate).  This  means  that  there  is  considerable 
room  for  improvement.  The  question  this  chapter  examines  then  is,  can  A 
be  reduced  and  if  so,  by  how  much  and  how  easily? 

Various  methods  for  improving  A are  briefly  discussed,  including 
adjusting  the  displacement  and  the  number  of  address  registers,  improving 
compilation  procedures  and  inclusion  of  architectural  features  that 
improve  a CPU's  addressing  capabilities.  Though  many  of  these  suggestions 
are  applicable  to  almost  any  architecture,  they  are  mostly  aim.ed  at  an 
R-H)  architecture  like  the  IBM  360. 


3.2.  R+D  Optimization 

In  an  R+D  architecture,  the  primary  contribution  to  A is 
incurred  in  two  ways;  first,  by  the  register  select  and  displacement 
bits  required  by  all  memory  reference  instructions,  and  second,  by  the 
instruction  and  data  bits  required  to  load  and  manipulate  the  address 
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(base)  registers.  There  is  an  obvious  trade-off  here,  since  if  we  use 
fewer  registers  and  a smaller  maximum  displacement  the  r+d  contributions 
to  A will  be  smaller.  With  fewer  registers,  however,  they  will  have  to 
be  loaded  more  often,  possibly  increasing  A.  In  this  section  an  approxi- 
mate model  of  this  trade-off  is  proposed.  The  model  is  now  developed. 
Definition  3.2.1:  A Register  Fault  occurs  when  a base  register  must  be 

loaded  in  order  to  make  a memory  access.  Z 

Definition  3.2.2:  The  Register  Fault  Rate,  denoted  by  i',  is  the  proba- 
bility that  a register  fault  will  occur  on  the  next  reference.  - 

A can  be  approximated  then  as 

a.+aa+a.  ,+ac  +a_ 

-"  = S 

c 

where  a„,  = (2r-ki  -f-  32  + 2)RX„.,  , the  total  number  of  bits  due  to  data 
u B L ' 

register  loads.  number  of  base  register  loads  computed  by  the 

filter  program,  32  bits  are  loaded,  2 opcode  bits  are  used  to  name  the 
addressing  mode,  r+d  bits  are  required  to  access  the  operand  to  be 
loaded,  and  r bits  are  required  to  specify  the  destination  register. 

ag  = (2  + r+d) (RXp +RSIp)  + (2  + 2r  + 2d )SSp,  the  total  bits  due 
to  memory  reference  instructions. 

3.,,  = U2 +r)  (RRj^+RR^^)  + (16 +r+d)(RXj^+RX^^)  + SRX^  + 

32  (RX^^ +RX^j^) , the  total  bits  due  to  all  indexing  operations.  The  12  and 
16  are  the  number  of  bits  in  these  instructions  except  for  r and  d fields. 

total  bits  due  to  register  reference  instructions. 


P 
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a,  = (2r+d+2)RX„  + (r+d +2)12-1+  (16+d)R:<  , t- 
b BA 

16CRR_  +RR_.)  + 32  C3LM +RX„ +RX„  . ) , the  total  bits  due  to  miscellaneous 
BA  B L o b A 

operations  such  as  base  register  stores. 

In  the  formulas  above 

r = number  of  bits  for  base  register  selection, 

d = number  of  bits  to  specify  displacement, 

N = number  of  computation  references, 
c 

addressing  process  statistics: 

RR^  = number  of  loads  to  index  registers  (register  reference), 

RR  = number  of  loads  to  base  registers  (register  reference), 

BL 

RX^  = number  of  loads  to  index  registers  (m.emory  reference), 

RX  - = number  of  loads  to  base  registers  (memorv  reference), 

BL 

RR,  = number  of  index  register  arithmetic  instructions  (register 
AA 

reference) , 

RRg^  = number  of  base  register  arithmetic  instructions  (register 
reference) , 

= number  of  index  register  arithmetic  instructions  (m.emory 
reference) , 

RX^^  = number  of  base  register  arithmetic  instructions  (memory 
reference) , 

RXg  = number  of  stores  from  activated  registers, 

computation  process  statistics: 

RR  = number  of  register  reference  instructions, 
r 

RX_  = number  of  memory  reference  instructions, 

r 


SS„  = number  of  storage  to  storage  instructions, 
r 


r 


k 
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RSI„  = register  storage  and  storage  immediate  instructions,  and 
r 

LM  = number  of  load  multiple  register  instructions.  On  the  average 
3 registers  are  loaded  per  instruction. 

Now  define  o;  (a  estimate)  as  the  fraction  of  references  devoted 
to  base  register  load  fetches 

RX.. 


zi  = 


BL 


N +RX„,  ■ 
c BL 


k can  then  be  written  as 

a.-+a  +a,.+a^  ~ 

A = -£ :L_J ^ + (^)  (2rH^  +32  +2)  . 

The  number  of  opcode  bits  for  each  addressing  operation  is 
assumed  to  be  2 bits,  since  there  are  basically  4 operations  to  be 
delineated 

- RX,  RS , or  SI  memory  reference, 

- SS  memory  reference, 

- Base  Register  Load,  and 

- No  memory  reference  (RR  instruction) . 

(Note:  these  opcode  bits  are  not  explicit  within  the  360,  but  we  assume 

their  existence  both  here  and  in  the  filter  algorithm,  since  they 
represent  an  approxiimite  quantity  of  information  that  is  required  by  the 
addressing  process  for  its  proper  operation.) 

The  above  model  does  appear  to  be  quite  complicated.  However, 
by  more  accurately  characterizing  the  addressing  process,  (for  the  proper 
direct  comparisons  can  be  made  to  the  actual  trace  results.  In  the 
model,  all  contributions  to  A remain  the  same  except  for  their  r and  d 
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components  and  the  load  rate  cc.  what  is  now  needed  is  a model  of  the 
O'  vs.  r+d  process. 

The  first  major  problem  we  encounter  concerns  the  allocation  of 
registers  within  the  program.  This  in  itself  is  a difficult  problem,  which 
will  be  discussed  later  in  this  chapter.  A simple  model  though  can  be 
created  by  making  certain  assumptions.  First,  consider  the  input  stream 
to  an  addressing  encoder  as  being  the  memory  references  generated  by  a 
Markov  process.  Let  us  assume  that  we  know  the  probabilities, 

^))  of  this  process,  where  m^  is  a memory  location  and  g^  ^ is  an 
n-1  gram  of  the  n-1  previous  locations  referenced.  These  are  called  the 
nth  order  conditional  probabilities,  since  they  are  used  to  calculate 
the  nth  order  entropy  H^(S).  Assume  for  this  m.odel  that  no  higher  order 
behavior  exists,  that  is  that  H (S)  = E_(S).  This  assumption  results  in  an 
nth  order  process  (n-lth  order  Markov),  since  the  p(m^[g”?  ~)  are 
independent. 

These  references  then  enter  a finite-state  encoder  which  contains 

2 base  registers  of  width  w,  and  depending  on  the  contents  of  the  base 

registers,  outputs  a stream  of  codewords  of  the  set  [it  t.,  Ot  t ,t  } where 

r d r d w 

It^t^  represents  register  select  plus  displacement,  and 

Ot^t^t^  represents  register  select  plus  displacement  and  register 

load  (t  is  loaded  with  t then  displaced  from  with  t.) 

r w d^ 


where 

t^  is  a register  select  word  of  r bits, 

t^  is  a displacement  word  of  d bits,  and 

t is  a register  load  word  of  w bits . 
w 
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This  stream  can  then  be  decoded  by  a CPU-like  mechanism 
containing  addressing  registers,  thereby  generating  the  original 
reference  stream.  The  communication  rate  (in  bits/computation  reference) 
using  an  R40  coding  mechanism  can  now  be  defined. 

Definition  3,2.3:  Let  R be  the  coding  rate  of  the  above  addressing 

architecture  , i .e  . , 

R = (1  + r + d)  (1 -a  ) T (1  + r +d +w)  (c.') 

or  R=(l+r+d)+Ow.  C 


R is  a little  more  convenient  to  use  than  A for  assessing  the  efficiency 
of  R+D  coding,  since  it  does  not  contain  other  miscellaneous  operations  that 
are  included  in  A (e.g.  indexing)  and  does  not  include  the  opcode  bits 
required  to  delineate  these  other  operations. 

then  represents  the  proportion  of  register  load  insertions 
needed  by  a nth  order  process.  These  loads  can  be  allocated  by  using  the 
following  guidelines  (given  the  particular  state  of  the  coder,  i.e.  the 
contents  of  the  set  of  base  registers); 

- Only  reload  a register  when  a fault,  ra*,  occurs  (location  m*  is 
referenced  but  not  within  range  of  any  register). 

- Reload  one  register  per  fault. 

- The  register  to  be  reloaded,  R.,  is  selected  to  be  the  one  with 

J 

greatest  expected  next  use  time. 

- The  address  loaded  into  R^  must  include  m*  in  its  range  and  should 
be  chosen  to  maximize  its  expected  reload  time. 

In  a first  order  model,  the  set  of  referenced  words  are  allocated 
to  memory  such  that  m^<  m^  if  p(m^)  > p (m^ ) (see  Figure  8),  this  is  called 
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an  allocation.  Since  higher  order  models  become  rather  difficult, 
we  shall  assume  a first  order  process  for  the  model.  Furthermore,  we 
assume  for  simplicity,  that  the  probability  distribution  is  forced 
into  discrete  intervals,  each  2^^  locations  wide.  These  restrictions 
allow  a simpler  model,  but  the  obtained  will  be  greater  than  the  actual 
since  less  loading  freedom  is  available.  Using  the  register  load 
guidelines  it  is  easy  to  see  that  2 -1  registers  should  be  loaded  to  cover 
the  intervals  with  largest  probability  of  reference  and  that  the  last 
register  should  be  used  to  cover  one  of  the  "other  intervals."  The 
registers  are  said  to  be  in  state  i when  the  last  register  covers  the 
"other  interval"  associated  with  . Let  P = the  probability  of  a 
reference  within  the  range  of  the  2 -1  fixed  registers.  Then  1-P  is  the 
probability  of  a reference  within  the  "other  intervals,"  i.e.. 


k 

1-P=  S p. . Therefore, 
i=l  1 

and  the  probability  of  a 

in  state  i , is  1 - P - P. 

’ 1 


P. 

the  probability  of  being  in  state  i is  — r , 
fault  on  the  next  reference,  if  the  register 
so 


is 


a 


1 


Since  the  can  be  determined  from  the  traces,  G and  A can  be  computed 
for  various  values  of  r+d . 

The  model  tends  to  overestimate  P<  for  the  following  reasons : 
i)  The  registers  contained  only  integer  multiples  of  the  displacement. 
This  allowed  a much  simpler  model,  but  also  overestimates  O’,  since 
less  freedom  is  available  on  register  loading  to  cover  current 
references  and  future  probable  references. 
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ii)  In  an  actual  program  only  the  data  references  and  branch  fetches 
would  require  the  use  of  the  R+D  architecture,  since  sequential 
instruction  fetching  is  done  by  the  program  counter.  Our  first 
order  model  modeled  all  references,  therefore  in  actual  programs,  S 
would  be  less  than 

iii)  It  was  assumed  that  S was  a first  order  process  (zeroth  order  Markov) 
which,  as  shown  in  Chapter  2,  is  far  from  true.  This  leads  to  more 
overestimation , since  if  a process  is  of  higher  order,  a policy  for 
memory  and  register  load  allocation  can  be  implemented  which  uses 
this  higher  order  information  for  further  improvement.  This  is 
especially  important  when  well-defined  groupings  exist,  since  the 
multiple  codebook  characteristic  of  the  R+D  architecture  makes 
efficient  use  of  such  behavior. 

There  is  one  other  simplification  which  contributes  to  the 
underestimation  of  i.  It  nas  been  assumed  that  2^  registers  are  always 
available  to  the  addressing  process,  however  in  an  IBM  360,  these  registers 
are  shared  v/ith  the  computation  process  and  there  could  be  fewer  registers 
available  for  addressing  at  times  when  2“  are  needed,  thereby  increasing 
the  fault  rate  over  that  required  if  all  2 registers  were  available  all 
the  time.  Incorporating  the  computation  process  into  our  model  though, 
would  have  made  the  analysis  much  too  difficult.  This  model  therefore 
assumes  a constant  2 registers  always  available  for  addressing. 

Figures  9 and  10  show  the  curves  of  A versus  the  sum  r+d  for 
our  four  traces.  The  sum  r+d  was  used  since  there  was  little  variation 


of  A when  r and  d were  varied  v/ithin  a constant  r+d.  We  began  with 


67 


r+d  = 15  bits  for  GAUSS,  ERROR,  and  EIGEK,  since  the  maximum  number  of 
address  registers  used  by  those  programs  was  7,  r = p-og^  7|  = 3 and 
d = 12.  LIST,  however,  used  a maximum  of  11  registers  for  addressing 
so  that  data  set  began  with  r+d  = 16. 

The  results  are  encouraging.  namely,  r+d  can  be  reduced 
significantly  before  faulting  occurs  at  all,  and  though  the  curves  rise 
steeply  to  the  left  (indicating  that  the  basic  working  space  is  too 
small)  the  minimum  A occurs  at  10  or  11  bits  (12  for  LIST)  for  the 
program  traces.  This  means  that  with  EIGEN,  for  example,  one  could  use 
8 registers  and  a maximum  displacement  of  7 bits  instead  of  the  original 
12.  Thereby  A is  decreased  from  24  to  about  13  bits/computation 
reference. 

Note  that  r+d  could  be  reduced  further  (moving  up  the  curve  to 
the  left)  to  get  approximately  the  original  values  of  A.  With  GAUSS  if 
r+d  = 8,  A^  19,  which  is  not  much  greater  than  the  original  A of  17.2. 
This  could  involve,  for  example,  4 address  registers  and  6 bits  for 
displacement.  .Although  there  is  a significant  fault  rate  at  this  point, 
which  slows  down  the  addressing  process,  the  RX  instruction  size  has  been 
reduced  by  seven  bits.  This  gives  the  designer  seven  more  bits  to 
implement  (Retimes)  more  complex  and  helpful  instructions,  thus  possibly 
speeding  up  the  computation  process.  In  order  to  preserve  design 
flexibility,  one  must  be  careful  not  to  go  too  far  to  the  left  on  the 
design  curve.  However,  as  discussed  above  the  O'  used  to  compute  these 
curves  is  overestimated.  Another  reason  to  avoid  going  too  far  to  the 
left  is  that  by  reducing  d too  much,  the  directly  accessible  sections  of 
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memory  become  more  fragmented  and  difficult  to  use.  Although  this  is 
true,  there  is  an  associated  advantage,  since  one  is  forced  to 
encode  the  reference  stream  more  tightly  around  existing  spatial 
localities,  therefore  removing  seldom  referenced  information  which 
takes  up  valuable  CPU/memory  bandwidth  and  contributes  little  or  nothing 
to  the  CPU's  computational  efficiency. 

The  two  main  conclusions  that  can  be  drawn  concerning  these 
results  are  that  the  IBM  360  r and  d values  are  too  large,  at  least  for 
our  traces,  and  the  original  compiler  generated  allocation  for  the 
programs  was  inefficient.  The  latter  can  be  seen  from  the  fact  that  all 
A calculations  are  slightly  less  than  the  actual  A values,  since  the 
model  computes  Q=  0 for  all  traces  at  r+d  = 15,  whereas  for  the  actual 
execution  of  the  programs  on  the  360  5/0.  GAUSS,  for  example,  had 

^ = .034.  This  discrepancy  is  important  for  two  reasons.  First, 
though  .034  seems  small,  it  can  be  seen  from  Figures  9 and  10  that  A 
is  quite  sensitive  to  small  changes  in  O'.  Second,  if  our  simplistic 
model,  whicn  tends  to  overestimate  O',  predicts  an  0=  0,  then  it  should 
be  easily  obtainable  by  a real  system.  The  reason  for  5=  .034  is  then 
poor  register  allocation  by  the  compiler  on  the  360. 

We  finally  ask,  how  good  a coding  mechanism  is  the  R+D 
architecture  or,  in  fact,  multiple  codebook  encoding  in  general? 

Table  9 shows  the  calculation  of  the  estimated  rate,  R (using  the  value 
for  r+d  that  gives  minimal  A with  w = 32),  for  each  trace, 

H^(S)  are  repeated  here  for  convenience.  For  all  the  traces,  R is 
fairly  close  to  Kq(S).  This  is  the  chief  advantage  of  R+D  addressing. 
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namely,  the  fact  that  one  can  code  more  or  less  independently  of  w, 

i.e.  the  coding  rate  depends  more  on  and  less  on  the  total  address 

w 

space  2 . Though  our  prograris  do  not  code  at  a rate  less  than  Hq(S), 
Hq(S)  is  not  an  absolute  lower  bound  for  this  type  of  multiple  codebcok 
encoding.  To  lower  the  coding  rate  further,  more  and  more  knowledge  of 
the  memory  reference  process  and  its  past  behavior  must  be  used  by  the 
CPU.  For  standard  computer  architectures  though,  R+D  encoding  is 
reasonably  efficient  and  easy  to  implement. 

R could  also  be  improved  by  using  variable  length  encoding  of 
instruction  addressing  fields.  For  example,  by  using  variable  length 
register  names;  placing  commonly  used  addresses  in  the  register  with 
the  shortest  name.  Displacement  could  also  be  encoded  with  variable 
length  codes,  though  either  extra  bits  must  be  available  to  give  the 
length  of  the  displacement  field  or  the  set  of  allov;able  displacements 
must  form  a uniquely  decipherable  codeword  set.  Frequency  based 
displacement  encoding  was  used  on  the  Burroughs  B1700  [ 13]  , ostensibly 

to  reduc  the  coding  rate  to  H^(S).  This  technique,  hovjever,  would  not 
be  much  more  efficient  than  a properly  tuned  Fv+D  architecture,  since  for 
all  our  traces,  the  difference  between  ti^(S)  and  KqCS)  is  usually  only 
a few  bits,  and  in  the  B1700,  two  bits  are  used  just  to  denote  field 
length. 

It  can  be  concluded  from  the  above  results  that  there  are 
wasted  bits  in  the  IBM  360  architecture.  However,  with  proper  design. 


1 

t 


an  R+D  architecture  can  code  fairly  efficiently.  A technique  has  been 
presented  for  analyzing  such  architectures.  This  architecture  is  easy 
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to  use  arid  can  be  embellished  further.  It  was  seen  that  the  allocation 
of  address  registers  must  be  done  well,  since  A is  sensitive  to  small 
changes  in  Qr  . In  the  next  section,  the  problem  of  better  compiler 
allocation  of  address  registers  is  discussed. 

3.3.  The  A.ddress  Register  Allocation  Problem 

As  was  demonstrated  in  the  previous  section,  if  one  could 
allocate  the  address  registers  more  optimally,  a could,  in  many  cases, 
be  reduced.  Except  for  the  'work  on  index  register  allocation  by 
Horo'witz  et  al.  [14]  , we  are  not  aware  of  any  work  in  this  area. 

Each  variable  has  a context  or  scope,  which  is  that  portion  of 
the  program  where  the  use  of  that  variable  is  defined.  A variable's 
context  could  arise  explicitly  from  a declaration  statement  or  implicitly 
by  actual  use  of  the  variable,  depending  on  the  syntax  of  the  particular 
language  being  used.  There  are  txi/o  types  of  variable  allocation 
problems:  intracontext  and  intercontext . Intraconteitt  allocation  is 

concerned  with  the  allocation  of  address  registers  (to  allow  access  to 
all  variables  active  within  the  current  context)  and  address  register 
loads  within  a context.  laterconte'xt  allocation  is  concerned  with  the 
allocation  of  address  registers  and  address  register  loads  when 
switching  from  one  context  to  another.  There  are  two  classes  of 
l.ir.guagcs:  Non-Block  Structured  Languages  and  Block  Structured  Languages, 
i if  ter  mainly  by  their  emphasis  on  either  intercontext  or  intra- 
•allocation.  In  Non-Block  Structured  Languages,  such  as  FORTRAN 
. • . cijntext  is  generally  the  entire  program  (sometimes 


includins  subroutines).  In  this  anvirom.ient , allocation  is  almost 
exclusively  intracontext  and  must  be  done  on  a large  scale  '..'ith  a large 
number  of  variables.  In  Block  Structured  Languages,  such  as  ALGOL  and 
P.iSCAL,  variables  are  declared  within  blocks,  which  can  be  fairly  small 
and  can  be  nested  to  an  arbitrary  level.  The  range  of  the  bloc!:  then 
becomes  the  context  for  the  variables  declared  within  the  block. 
Intracontext  allocation  is  still  necessary,  but  if  the  blocks  are  small 
enough,  i.e.,  if  the  program  is  well  structured,  intercontext  allocation 
will  clearly  be  a dominant  problem. 

Section  3.3.1  looks  at  intracontext  allocation  and  Section  3.3. 
deals  with  intercontext  allocation.  Though  both  types  of  allocation  are 
theoretically  similar,  they  are  dealt  with  separately,  since  they  are 
generally  implemented  quite  differently. 

3,3.1.  Intracontext  Allocatlcn 

The  intraconte:<t  allocation  problem  has  two  stages.  In  the 
first,  the  program  is  laid  out  in  memory,  each  program  word  and  each 
data  word  being  assigned  to  a unique  location.  This  process  w’ill  not 
be  discussed  any  further  here,  since  it  is  examined  in  more  detail  in 
Section  4.4.  It  will  be  assumed  then  that  before  the  second  stage  is 
entered  the  program  and  data  are  allocated  to  memory  to  take  advantage  of 
as  much  second  order  information  as  possible;  i.e.,  that  there  is  a large 
degree  of  spatial  locality  (the  possible  successors  to  a referenced  word 
are  spatially  close  to  that  word) . 

In  the  second  stage,  the  address  register  load  instructions 
and  all  displacements  are  inserted  into  the  program  in  such  a way  as  to 
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reduce  the  register  fault  rate  To  be  able  to  allocate  optimally,  one 

would  need  to  know  program  loop  freque'ncies  which  are  data  dependent. 

Of  course,  it  is  always  possible  to  restructure  a program  adaptively  each 


time  it  is  run  optimizing  the  allocation  for  a restricted  set  of  data. 
This  approach  will  not  be  considered  here. 

Register  load  instruction  placem.ents  in  the  code  and  register 
displacements  must  be  chosen  so  that  any  possible  data  reference  or  branch 
destination  is  covered  by  some  active  address  register.  This  is,  of 
course,  possible  if  we  place  a load  before  every  instruction  that 
references  memory.  However,  we  would  like  to  place  our  load  instructions 
such  that  for  any  possible  path  through  the  program,  including  large 
numbers  of  loops  or  cycles,  o'  is  minimal.  Such  an  objective  is  open- 
ended,  since  loops  could  be  iterated  indefinitely.  Also,  depending  on 
the  data,  it  may  be  that  such  an  objective  could  be  inconsistent.  This 
objective  though  is  a good  starting  point  for  the  development  of  heuristic 
algorithms . 

It  is  possible  to  estimate  the  complexity  of  a brute-force 
algorithm  to  solve  the  address  register  load  allocation  problem.  Assume 
there  are  n memory  reference  instructions  in  the  program.  There  are  then 
2^  possible  allocations  of  register  load  instructions  (one  before  each 
memory  reference  instruction),  2 possible  register  allocations  for  each 
node  and  2^^  possible  displacements  for  each  memory  reference  instruction. 
Therefore  there  are  w (2'^'^'^)  possible  solutions  to  the  allocation 


problem.  This  can  be  quite  large,  since  n is  usually  several  hundred. 
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Multiple  codebook  ideas  can  be  used  to  develop  heuristics  for 


intracontext  allocation.  One  example  is  as  follows;  the  memory  references 
can  be  spatially  clustered  heuristically  into  clusters  of  size  2'^.  (The 
clustering  being  done  to  minimize  the  probability  of  leaving  a cluster.) 

The  compiler  can  then  traverse  the  program  such  that  each  possible  path 
is  encountered.  An  active  cluster  is  one  which  has  an  address  register 
pointing  to  it.  There  are  then  2 active  clusters  at  any  time.  A 
location  fault  occurs  when  a memory  reference  instruction  is  encountered 
that  references  a word  which  is  not  in  an  active  cluster.  At  this  point 
an  address  register  load  is  inserted  by  the  compiler,  activating  the 
faulted  reference's  cluster.  The  register  to  be  loaded,  i.e.,  the  cluster 
to  be  deactivated  by  the  loss  of  the  register  holding  its  base  address, 
is  chosen  to  be  one  whose  next  use  is  estimated  to  be  the  furthest  away 
from  the  current  instruction.  This  algorithm  could  be  greatly  improved 
with  more  accurate  estimates  of  loop  and  branch  frequencies. 

The  main  point  of  this  section  is  to  define  and  demonstrate 
the  complexity  of  the  problem  of  intracontext  address  register  allocation. 

It  can  be  seen  that  any  significant  increase  in  R-rD  architecture 
efficiency  over  that  shown  in  Section  3.2  would  not  be  easy  to  effect 
by  this  means . 

3.3.2.  Intercontext  Allocation 

Because  it  is  more  data  dependent,  compiling  in  a Block  Structured 


Language  environment  would  seem  more  difficult.  This  is  not  entirely 
true,  since  now  the  primary  burden  of  coding  falls  onto  the  user. 
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providing  that  the  context  sv.'itching  mechanisms  of  the  architecture 
operate  efficiently.  Even  here  we  are  basically  operating  in  a multiple 
codebook  coding  environment.  When  a context  is  entered,  we  code  within 
that  context,  if  a lower  level  context  is  then  entered  we  are  coding 
within  both  contexts,  etc.  The  user  more  or  less  determines  how  many 
variables  he  will  declare  at  each  level,  and  the  context  switching  rate 
(by  his  choice  of  block  size).  The  architecture  should  provide  an 
efficient  coding  mechanism  both  within  the  context  and  for  context 
switching.  One  can  easily  see  that  structured  programming  techniques 
should  also  lead  to  improved  address  encoding,  since  they  stress  the  use 
of  highly  local  modules  thereby  reducing  the  context  switching  rate. 

The  360  is  inefficient  when  executing  Block  Structured  Languages 
for  two  reasons.  First  it  has  more  displaceirent  than  is  necessary  for 
intracontext  addressing.  For  example,  Wichman  [15]  has  shown  that  most 
blocks  rarely  need  more  than  seven  bits  to  specify  all  variables. 

Secondly,  intercontext  switching  requires  loading  a base  register.  In 
fact,  for  nested  contexts,  all  contexts  should  have  active  base  registers, 
leading  to  a large  number  of  base  register  loads  in  a Block  Structured 
environment.  Another  example  of  360  inefficiency  is  in  array  allocation. 
Local  arrays  cannot  be  allocated  until  a block  is  entered,  since  the 
array  dimensions  need  not  be  supplied  until  that  time.  Currently, 
most  compilers  handle  this  problem  by  allocating  a register  each  time  a 
block  is  entered,  this  register  points  to  a specially  allocated,  variable 
size  area  of  memory  which  is  used  for  block  operand  storage.  Usually 
the  allocated  data  area  does  not  even  contain  the  arrays  themselves. 


but  rather  dope  vectors  (Cries  [16])  which  contain  pointers  to  the 
actual  data  area.  The  array  buffer  is  then  allocated  dynamically  by 
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special  code  inserted  into  the  program  by  the  compiler.  This  technique 
requires  multiple  level  indirection,  a difficult  matter  for  the  360  which 
m.ust  reload  address  registers  continually. 

Various  approaches  have  been  used  to  make  the  above  operations 
more  efficient.  The  most  common  technique  is  to  use  a form  of  intermediate 
storage.  This  approach  places  another  level  of  memory  between  primary 
memory  and  the  register  set.  It  is  not,  however,  a cache  memory,  since  a 
cache  still  requires  full  addressing.  Intermediate  storage,  through  the 
use  of  a reduced  name  space,  allows  one  to  reference  with  reduced 
addressing  requirements.  Some  schemes  have  had  separate,  intermediate 
stores  for  operands,  vectors,  and  instructions.  A good  example  of  this 
type  of  architecture  is  the  descriptor  based  or  tagged  architectures 
(Welch  [17]). 

All  such  schemes  basically  have  the  following  characteristics; 
i)  a small  size  intermediate  storage  area  between  memory  and  registers, 
which  reduces  the  mapping  requirements  of  the  instruction  set  which 
operates  on  this  store",  and 

ii)  the  existence  of  either  an  automatic  mechanism  which  loads  this 

store  on  a demand  basis  (such  as  the  Name  Store  of  the  MU5 , Ibbett 
[18]),  or  the  use  of  specific  instructions  (block  moves  for  example) 
to  do  the  job. 


A detailed  analysis  of  the  above  technique  is  beyond  the  scope 
of  this  section,  however,  we  can  conclude  that  with  this  technique,  it  is 
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possible  to  address  somewhat  more  efficiently  in  Block  Structured  Languages. 
There  is  no  reason  why  such  an  approach  should  not  be  codable  with 
R Hq(S).  For  most  of  our  traces  this  would  be  an  improvement  from  the 
original  A.  For  example,  there  would  be  significant  improvement  for  LIST 
('Hq=  11.27,  A=24.1),  which  incidentally  is  our  only  Block  Structured 
Language  trace  and  has  the  highest  A. 

The  main  conclusion  one  can  derive  from  this  and  the  previous 
section,  is  that  coding  on  a standard  register  plus  displacement  archi- 
tecture cannot  easily  be  driven  below  Hq(S).  Consequently,  due  to  limita- 
tions in  register  allocation  procedures  and  in  standard  R+D  architectures, 
improvement  beyond  that  shown  in  Section  3.2  would  be  difficult  and 
expensive.  This  holds  true  regardless  of  the  class  of  language  being  used. 
The  incorporation  of  higher  order  information  into  standard  R-fD  archi- 
tectures should  therefore  not  be  considered  further. 

In  the  next  section  several  general  improvements  to  IBM  360 
addressing  architecture  are  assumed  for  one  of  our  traces.  Results  show 
some  approximate  gains  in  addressing  efficiency  which  can  easily  be 
achieved . 


3.4.  Architectural  Improvements 

So  far  in  this  chapter,  two  areas  of  A reduction  for  more  or 
less  standard  R-f-D  architectures  have  been  examined,  that  of  R+D  optimi- 
zation and  the  related  area  of  improved  address  register  allocations. 

In  this  section  indexing  and  other  miscellaneous  contributions  to  A are 
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explored.  Also,  an  example  of  the  improvement  that  can  be  easily  obtained 
by  the  above  techniques  and  some  ideas  on  different  ways  of  implementing 
standard  addressing  architectures  are  presented. 

In  some  programs  such  as  GAUSS,  indexing  overhead  contributes 
significantly  to  A.  This  contribution  can  be  reduced  by  adding  special 
indexing  capabilities.  (Note:  Some  programs  do  not  have  a large  num.ber 

of  index  operations,  so  depending  on  the  job  mix,  it  may  not  be  cost 
effective  to  add  indexing  instructions  and  hardware.  Also  for  som.e  job 
mixes  the  increased  indexing  efficiency  may  save  fewer  bits  than  the 
number  of  bits  added  by  the  new  opcodes.) 

To  demonstrate  the  amount  of  indexing  overhead  that  indexing 
operations  incur,  a matrix  multiply,  which  is  similar  to  the  kind  of 
operations  used  by  GAUSS,  is  analyzed.  Table  10  (flowchart  in  Figure  11) 
lists  a short  IBM  360  assembly  language  program  that  m.ultiplies  two 
matrices,  A (nXm)  and  B (rri<t),  placing  the  result  in  C (nXt).  The  com.puta- 
tional  instructions  are  starred  in  Table  10  with  the  remaining  instructions 
being  inde:<ing  or  addressing  overhead.  The  program  is  by  no  m.eans 
optimal,  but  is  intended  as  an  example  of  the  overhead  that  can  be 
incurred  in  vector  and  matrix  operations  on  the  360. 

Since  the  inner  loop  occurs  n-m-t  times,  the  next  loop  n*t  times, 
and  the  outer  loop  n times,  A can  be  computed  by  sum.ming  the  bits  in  the 
addressing/inde:<ing  instructions,  their  data  fetches,  and  the  addressing 
portions  of  the  (starred)  computation  instructions 
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Table  10 

Matrix  Multiplication  Program 


Register  Contents: 


R1 

- BA, 

Matri-.j  A' 

' s 

Base 

Address 

R9 

- M-.'.-Tv.-4 

R2 

- BB, 

Matrix  B' 

' s 

Base 

Address  | 

RlO 

- lA 

R3 

- RC, 

Matrix  C' 

' s 

Base 

Address 

Rll 

- IC 

R4 

- I 

1 

R12 

- Partial  Result  Regist 

R5 

- J 

R13 

- Running  Sum  Register 

R6 

- Constant  "4" 

R14 

- Unassigned 

R7 

- IB 

R15 

- Constant  "0" 

RS 

- T---4 

R16 

- Program  and  Data  Base 

Defined  Memory  Locations  ; 


(R16)  + 0 

- Base 

Address 

for 

Ma  t r ix 

A 

4 

- Base 

Address 

for 

Matrix 

B 

8 

- Base 

Address 

for 

Matrix 

C 

12 

- N 

16 

- T 

[ 20 

- M 

24 

- "0" 

1 28 

_ It/,  II 
■4 

32  - Start  of  Program. 


Comment;  The  360  has  byte  (8  bits)  addressing,  but  words  are  4 
long,  therefore  all  addresses  are  in  multiples  of  4. 


Register 


i 


•ytes 


I 

i 

( 


I 


Table  10  (continued) 
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Contribution 


in  bits  to  A 

Code 

Comments 

224 

Rl,  R6,  0(R16) 

R1<-BA;  R2^BB;  R3^BC; 
R4<-N;  R5^T;  R6<-”4" 

64 

L 

R15,  0(R16) 

RIS''-  "0”; 

16 

m 

R7,  R15 

IB^  0; 

16 

LR 

RIO,  R15 

IA<-  0; 

16 

LR 

Rll,  R15 

IC^  0; 

16 

LR 

R13,  R15 

Rl3<-o; 

16 

LR 

R8,  R5 

y R8  ^ T*4 ; 

16 

MR 

R8,  R6 

J 

16 

LR 

R9,  R8 

1 R9  «-  M*T-v4  ; 

64 

M 

R9,  20(R16) 

> 

ISnmt 

LOOP;  L* 

R12 

i R12^A'lA^*BriB'; 

ISnrnt 

>r- 

R12 

J 

0 

AR 

R13 

R13^R13  +R12; 

16nmt 

AR 

RIO 

IA^IA+4; 

32nmt 

BXLE  R7,  R8,  LOOP 

r IB<-IB+T*4; 

1 IF  IB>M*T-v4  THEN  GOTO  LOOP; 

18nt 

ST* 

R13,  0(R11,R3) 

CtIC]  <-  R13; 

0 

LR* 

R13,  R15 

R13^  0; 

16nt 

AR 

Rll,  R6 

IC*-  IC  +4; 

32nt 

BCT 

R4,  ALT 

' 1*-I-1; 

IF  I>  0 THEN  GOTO  ALT; 

64n 

L 

R4,  16(R16) 

I<-  T; 

32n 

BCT 

R5,  ALT 

r J'^J-1; 

1 IF  J>0  THEN  GOTO  ALT; 

A 


Table  10  (continued) 


32 

BC 

15,  DONE 

64nt 

ALT;  L 

RIO,  12(R16) 

16nt 

SR 

RIO,  R4 

16nt 

MR 

o 

o^ 

64nt 

M 

RIO,  20(R1&) 

64nt 

L 

R7,  16(R16) 

16nt 

SR 

R7,  R5 

16nt 

MR 

R7,  R6 

32nt 

BC 

15,  LOOP 

Ten.  inate 


lA^  (N-I)-;.- 


(T-J)* 


I 
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N = 5mnt  + 3nt 
c 

~ 496  + 84nmt  + 354nt + 96n 

A = , and 

‘ c 

= 17  (mnt  + 5nt  + 7)  -r64(3nt+n4-3)  +3  (2nmC  + nt ) 

(a  is  the  indexing  overhead  which  is  defined  in  Section  3.2).  Letting 
Y 

n = IT.  = t = 14  as  in  GAUSS  we  have  i 

A = 21.2  bits/computation  reference,  A.^,  = 42.6 
(A  being  49.7%  of  A^) , and  a^,/h'^  = 8.3. 

Incidentally,  this  data  is  soir.ewhat  close  to  that  for  GAUSS  where  A=17.2 
and  a^.N^  = 5.0. 

Most  of  the  overhead  corr.putat : on  in  the  example  is  devoted  to 
incrementing  and  checking  subscripts  and  napping  those  subscripts  into  a 
one-dimensional  array.  An  obvious  solution  to  this  problem  is  to  provide 
a set  of  two-dimensional  matrix  operations  (with  a vector  being  a matrix  : 

with  one  dimension  being  unity).  Such  operations  can  be  easily  imple- 
mented on  a microprogrammab le  machine.  Furthermore,  since  we  are 
proposing  to  reduce  displacement  by  at  least  3 or  4 bits,  the  extra 
opcodes  required  can  be  easily  accommodated . Such  matrix  instructions 
would  no  doubt  cover  most  array  operations,  since  few  arrays  have  more 
than  two  dimensions.  (Wichmian  [19]  has  discovered  that  99.5%  of  all  arrays 
in  the  .ALGOL  programs  he  examined  had  one  or  two  dimensions.)  With  a 
matrix  multiply  operation,  the  addressing  uncertainty  (given  the  dimensions 
and  base  addresses  of  the  arrays)  is  close  to  zero,  and  so  is  the  addressing 
overhead.  In  this  case  then  the  CPU  is  being  given  specific  knowledge  of 


r 
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the  addressing  characteristics  of  the  program  so  that  it  can  predict 
addressing  exactly  without  further  information.  Creating  instructions 
to  perform  the  desired  task  may  seem  an  extreme  solution.  However,  matrix 
and  vector  operations  are  not  so  specialized.  Common  matrix  operations 
occur  in  many  different  classes  of  program  environments.  Of  course  high 
level  languages  and  compilers  would  have  to  be  adapted  to  allow  the  user 
to  take  advantage  of  the  natrix  facilities  of  the  machine. 

It  is  not  unreasonable  to  assume  that  nearly  all  of  the  indexing 
overhead  in  GAUSS  could  be  covered  by  such  special  operations.  Such 

operations  then  would  reduce  a,  /N  from  5 to  nearly  zero. 

V c 

Furthermore  if  we  reduce  such  wasteful  operations  as  base 
register  stores  and  base  register  arithmetic,  (^3)  can  also  be 

forced  to  nearly  zero.  In  Section  3.2,  A was  reduced  by  about  25%  (from 
17.2  to  about  13)  by  reducing  r-fd . Gathering  together  the  above  improve- 
ments, one  can  effect  a total,  prorated  decrease  in  addressing  overh.ead 
to  approximately; 


A = (17.2  - a^/K^  - a.yN^)-'.-(l  - .25)  7 bits /cemputat ion  reference, 

which  is  an  improvement  of  about  58%  in  addressing  efficiency.  This  is 
a significant  improvement  and  was  fairly  easily  obtained  even  within  a 
standard  architecture.  A is  thereby  brought  down  to  approximately  Hq(S). 
(Actually  A is  less  than  Hq(S)  for  GAUSS,  because,  of  our  being  able  to  take 
advantage  of  the  higher  order  structure  of  the  index  operations.  This, 


of  course,  is  not  always  possible.) 


AO-A044  313  ILLINOIS  UNIV  AT  URBANA<<HAMPAION  COORDINATED  SCIENCE  LAB  F/6  9/2 
ANALYSIS  OF  MEMORY  ADDRESSINB  ARCHITECTURE. (U) 

JUL  77  D W HAMMERSTROM  DAAB07-72-C>0259 

UNCLASSIFIED  R-777  pji 


2 OP  0 

AD 

A0443I3 


d 


t 

I 


L 


85 

It  appears  though  that  any  improvement  beyond  this  point  would 
be  difficult  and  costly.  In  other  words,  further  improvements  would  be 
a matter  of  marginal  returns  on  further  investments,  at  least  when  dealing 
with  additional  CPU  improvements.  In  the  next  chapter  a totally  different 
approach  to  A reduction  is  taken  by  examining  new  types  of  memory 
architectures . 

Before  concluding  this  section,  however,  a brief  comment  is  in 
order  concerning  the  parallelism  inherent  in  the  addressing  and  computation 
processes.  These  processes  operate  more  or  less  independently  of  each 
other,  except  when  the  computation  process  sends  the  results  of  certain 
operations  to  the  addressing  process  (for  branch  decision  making)  and 
receives  instructions  and  operands  from  the  memory  through  the  addressing 
process.  Many  large  systems  have  tended  to  split  processing  along  these 
lines,  though  they  have  not  explicitly  defined  the  addressing  process  as 
such.  One  exception  to  this  is  the  work  of  Flynn  [20] , in  which  he 
explicitly  defines  three  separate  machines  which  execute  in  tightly 
coupled  parallelism.  These  machines,  however,  do  not  correspond  exactly 
to  the  definitions  of  computation  and  addressing  used  here. 

Consequently,  though  we  do  not  actually  reduce  A by  the  splitting 
up  of  these  processes,  there  is  still  a great  potential  for  significant 
speed  gains  by  taking  advantage  of  such  parallelism.  For  example,  one 
could  define  a machine  which  contains  an  address  processor  and  a computa- 
tion processor.  By  extension  there  could  exist  a pool  of  computation  ' 

processors  served  by  a large  number  of  address  processors.  j 
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In  conclusion,  it  was  shown  in  this  section,  that  there  is 
room  for  significant  improvement  in  addressing  overhead  in  the  360 
implementation.  This  improvement  can  be  achieved  fairly  easily  and  tends 
to  increase  overall  architecture  efficiency.  Such  improvement,  however, 
is  only  possible  up  to  a point.  Beyond  this  point,  the  capabilities  of 
the  compilers  and  architectures  become  unreasonably  strained  for  marginal 
further  gains. 

3.5.  Summary  and  Conclusions 

In  this  chapter  the  problem  of  A reduction  with  standard 
architectures  was  examined.  It  was  demonstrated  that  in  the  R+D  archi- 
tecture of  the  IBM  360  the  number  of  registers  and  the  displacement  can 
be  reduced  significantly.  It  was  also  shown  that  the  R+D  architecture 
is  reasonably  efficient,  especially  when  executing  a program  which  has 
tightly  bound  groupings  of  referenced  locations.  Currently,  address 
register  allocation  is  done  poorly  and  can  be  easily  improved  to  a point, 
but  going  beyond  that  point  is  an  exceedingly  difficult  problem. 

Other  improvements  are  possible  in  the  indexing  overhead,  which 
can  be  fairly  large  for  some  programs.  The  inclusion  of  matrix  operations 
in  the  CPU  instruction  repertoire  can  certainly  increase  addressing 
efficiency  in  certain  program  environments  essentially  using  the  opcode 
field  to  give  the  CPU  knowledge  of  commonly  used  accessing  patterns. 

Section  3.3  examined  the  possibility  of  further  reduction  of  O' 


through  better  register  allocation.  This  section  concluded  that  such 


reduction  of  a was  indeed  difficult.  Hence  further  reduction  of  A 
by  more  optimal  register  allocation  should  not  be  considered. 

The  main  conclusion  that  can  be  drawn  from  this  chapter  is  that 
it  is  relatively  easy  to  reduce  A significantly  until  it  is  reasonably 
close  to  Hq(S).  Note  also,  that  many  of  our  changes  result  in  fewer 
fetches.  Therefore  not  only  are  we  reducing  the  addressing  overhead  and 
the  CPU/Memory  bandwidth,  but  also  the  number  of  memory  accesses. 

However,  Hq(S)  is  still  much  larger  than  H(S)  or  even  H^CS).  Therefore, 
reducing  A further  would  require  major  and  expensive  changes  in  an  R-H) 
CPU  architecture.  However,  rather  than  concentrate  any  more  on  the  CPU, 
we  now  focus  our  attention  on  the  improvement  of  memory  architecture. 

The  nature  of  these  changes  and  reasons  for  them,  as  well  as  some 
experimental  results  are  presented  in  Chapter  4. 
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CHAPTER  4 

SECOND  ORDER  MEMORIES 

4.1.  Introduction 

In  Chapter  2 the  addressing  overhead,  A,  was  defined  and  then 
obtained  for  various  programs.  It  was  shown  that  A is  lower  bounded  by 
the  information  content  of  the  computation  process  referencing  stream, 

H(S),  and  that  with  real  programs  A (estimated  A)  is  one  to  two  orders 
of  magnitude  greater  than  H(S)  (estimated  H(S)).  In  Chapter  3 various 
methods  of  reducing  A to  increase  addressing  efficiency  were  examined, 
from  improved  and  more  finely  tuned  architecture  to  compiler  design.  All 
these  improvements,  however,  assumed  a standard  random  access  memory 
architecture.  In  this  chapter  we  examine  second  order  memory  architectures, 
that  attempt  to  take  advantage  of  ^^(S),  which  is  much  smaller  than 
Hq(S)  and  Hj^(S),  and  for  our  traces  was  always  less  than  2. 

4.2.  Theory  of  Higher  Order  Memories 

The  main  responsibility  of  the  addressing  process  is  to  map 
(decode)  the  input  stream  (A  bits /computation  reference)  into  the  output 
stream  of  memory  references  (B  bits/computation  reference).  Let  us  define 
some  related  terms,  and  then  B,  more  precisely. 

I 

Definition  4.2.1:  An  interaction  with  memory  involves  the  sending,  in  i 

parallel,  of  a packet  of  v address  bits  to  the  memory.  □ , 

Definition  4.2.2:  A location  fault  occurs  when  more  than  one  interaction 

' 


with  memory  is  required  to  fetch  (or  store)  the  desired  word  for  the 


I 
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computation  process.  The  location  fault  rate,  X,  is  defined  as: 

c 

where  is  the  total  number  of  interactions  made,  including  addressing 
overhead,  and  is  the  number  of  computation  references  made  also 

X = lim  X. 


Definition  4.2.3:  The  average  number  of  bits  flowing  from  the  CPU  to  the 

memory  is 


B = A.V 


B = Xv 


where 


V > 1. 


Generally  v=w=log2  (Memory  Size),  however  a key  point  in  Definition  4.2.3, 
is  that  smaller  packets,  v « w,  can  be  sent,  but  then  normally  X>  1. 

As  we  shall  see  in  this  section,  certain  memory  architectures  exist  which 
allow  a small  v without  a corresponding  increase  in  X.  Note  that  when  v=w, 

Nt 

B = as  previously  defined, 

c 

B tends  to  be  much  larger  than  A.  For  example  the  LIST  program 
had  A = 24.1  and  B = (total  references/computation  references )*32  = 

47.3  bits/computation  reference.  If  we  could  reduce  B we  could  reduce  the 
mapping  requirements.  In  fact,  if  we  could  reduce  3 to  below  the  presently 
observed  A we  could  eliminate  much  of  the  CPU's  addressing  responsibilities 
as  well  as  reducing  A,  thereby  increasing  addressing  efficiency.  To 


accomplish  this  we  need  to  take  advantage  of  the  higher  order  behavior 
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of  the  referencing  stream.  This  speculation  motivates  the  investigation 
of  higher  order  memories. 

Definition  4 .2 .4  ; An  nth  order  memory,  (n  > 1)  has  the  property  that 

the  next  location  accessed  by  the  memory  depends  nontrivially  on  the 
previous  n-1  locations  accessed  in  addition  to  the  newly  arriving 
addressing  information. 

For  each  one  may  use  any  integer  length  of  packet,  v.  With 
v=w,  represents  a standard  random  access  memory  which  requires  w bits 
per  interaction.  With  v=l,  the  optimal  form  of  is  a random  access 
memory  with  Huffman  encoded  addresses.  Higher  order  memories  begin  to 
look  desirable  because  of  the  following  theorem  and  corollary. 

Theorem  4.2.1:  With  v=l,  an  nth  order  memory  that  retains  and  uses 

information  regarding  exactly  the  previous  n-1  references  in  addition  to 


its  normal  storage 


Bn  > (n>0), 


Proof:  Assume  that  memory  allocation  is  fixed,  i.e.,that  a single  code- 

book is  being  used.  Assume  also  that  no  addressing  information  other  than 
the  previous  n-1  references  is  retained.  By  Theorem  2.3.2  then  we  have 


1 . > H I 

i,J  1 J 


Since  is  just  the  average  number  of  bits  required  by  the 
memory  to  code  reference  m^  (sending  1 bit  at  a time,  i.e.  v = 1)  given 
the  n-1  gram  we  have 
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’ Corollary  4.2.1:  For  v>  1 

j 

; • ■ I 

Proof : Since  > H^(S)  for  v=l,  then  any  other  v requires  multiples  of  | 

one  bit.  Consequently,  coding  could  be  at  least  as  efficient  as  with  v=l,  I 

I ^ 

I but  never  better.  2 | 

I Though  coding  would  tend  to  be  more  efficient  with  smaller  packet  size,  it 

[ 

[ is  not  always  more  efficient.  Of  course  with  v=l,  coding  is  as  efficient 

[ as  with  any  v>l,  but  any  v >v>l,  does  not  imply  that  one  can  always  code 

i 

[ more  efficiently  with  v than  with  v'.  This  seeming  anomaly  is  due  to  the 

fact  that  one  must  always  send  an  integer  number  of  packets  for  each 

reference,  which  may  lead  to  wasted  address  bits  when  v is  poorly  chosen. 

In  our  actual  360  programs,  the  B calculated  is  greater  than  the 

idealized  encoding  of  the  previous  theorem  since  not  only  are  we  dealing 

with  an  memory  (with  v=w),  but  also  we  must  perform  extra  fetches 

occasionally  to  keep  the  addressing  process  supplied  with  information  to 

implement  its  mapping  function.  If  we  eliminate  this  function  then  A 

becomes  approximately  equal  to  B.  This  approach  actually  degrades  a 

standard  360  architecture.  However  the  2nd  order  entropy,  H2(S),  was 

significantly  smaller  than  Hq(S)  and  H^(S),  and  was,  in  fact,  only  1 or 

2 bits/computation  reference.  Consequently,  if  we  can  build  an  , we 

have  a tremendous  opportunity  to  reduce  B and  therefore  A as  well,  since 

the  number  of  possible  second  order  successors  to  any  reference  is  much 
w 

less  than  2 . Consequently,  v could  be  reduced  well  below  w without  much 


increase  in  X . 
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4.3.  Examples  of  Second  Order  Memories 

The  next  questions  then  are  how  is  an  constructed  and  can 
it  actually  be  used  efficiently?  Recall  that  second  order  memories 
access  the  next  location  based  on  the  current  access  and  some  bits 
provided  by  the  CPU.  Basically  then  we  need  to  design  a simple  sequential 
machine  within  the  memory.  The  only  work  in  this  area  that  we  are  aware 
of  is  that  of  Sholl  [21]  and  Ouchi  [22].  Shell  developed  a Direct  Transi- 
tion Memory  which  is  a finite-state  machine  used  for  microprogram  control. 
It,  however,  considered  only  control  flow.  Ouchi 's  Orthogonal  Storage 
Ring  Memory  is  an  efficient  technique  for  using  shift  registers  as  random 
access  memory.  Much  of  the  efficiency  of  the  OSR  memory  arises  from 
being  able  to  capture  some  of  the  first  and  second  order  behavior  of  the 
program,  but  a highly  complex  layout  is  required  within  the  memory  chips. 

In  this  section  four  examples  of  second  order  memories  are 
presented  with  a discussion  of  the  advantages  and  disadvantages  of  each. 

In  succeeding  sections  the  two  most  promising  examples  are  examined  in 
greater  detail. 

First,  however,  some  discussion  of  the  general  problems  that 
occur  with  M2  memories  is  needed.  We  are  assuming  that  memory  inputs  are 
chosen  from  a set  of  constant  length  codewords.  This  assumption  is 
necessary  from  a hardware  point  of  view,  but  does  create  some  difficulty 
since  most  memory  references  made  by  the  computation  process  have  only 
one  successor,  but  a few  may  have  several  hundred.  Another  difficulty 
is  that  in  most  cases,  the  M2  must  be  created  by  using  a RAM  coupled  with 
an  in-memory  mapping  mechanism.  The  specific  nature  of  this  mapping 


mechanism  generally  leads  to  inefficiencies  in  the  way  the  RAM  is  used 
and  also  in  the  mapping  process  itself.  One  advantage  of  these  memories 
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though,  is  that  of  eliminating  the  mapping  hardware  within  the  CPU.  Thus 
although  some  form  of  index  registers  would  still  be  needed,  the  require- 
ment of  having  base  registers  and  operations  on  them  is  no  longer  valid. 

For  the  proper  operation  of  a true  M^,  the  memory  must  receive 
some  information  regarding  the  behavior  of  the  program  being  executed. 

This  information  allows  the  to  perforin  better  than  the  and  usually 

consists  of  the  successor  set  for  each  location  referenced.  The  obvious 
source  of  this  successor  information  is  the  compiler  or  assembler, 
perhaps  modified  by  a relocating  loader  at  load  time,  or  with  a suitable 
linking  mechanism  in  a paging  system.  For  most  instructions,  since  most 
are  fetched  sequent!  only  one  successor  exists.  Even  for  branches 

and  subroutine  ca’  -s  of  possible  successors  can  be  derived. 

Since  most  instructions  access  data  from  the  same  location  these  too  are 
not  difficult.  The  successor  identification  problem  becomes  sticky  though 
when  indirection  and  indexing  come  into  play.  For  Non-Block  Structured 
Languages,  the  compiler  should  be  able  to  at  least  identify  the  largest 
set  of  possible  successors  with  the  help  of  user  specified  parameters. 

For  Block  Structured  Languages,  the  problem  appears  difficult,  but  not 
impossible.  Section  4.5  contains  a more  detailed  discussion  of  this 


problem  and  some  possible  solutions. 

Following  are  four  examples  of  second  order  (M,)  memories. 


4.3.1.  CCD  Memories 


A CCD  (Charge  Coupled  Device)  shift  register  memory  is  a very 
simple  M2>  since  the  next  location  accessed  depends  on  the  previous 
location  accessed.  The  problem  with  CCDs  is  that  location  faults  would 
be  numerous. 

Since  H2  is  typically  greater  than  1,  the  average  uncertainty  to 
be  resolved  at  each  reference  is  greater  than  two  choices.  This  means 
that  even  with  optimal  allocation,  which  is  extrem.ely  difficult,  we  would 
have  a large  X.  Furthermore,  since  CCD's  typically  have  a linear  or 
multilinear  organization,  the  penalty  incurred  per  location  fault  may  be 
very  large.  Consequently,  a one  dimensional  shift  register  is  not 
considered  further. 

4.3.2.  Segmented  Counter  Accessed  Memory  (SCAJi) 

The  chief  limitation  of  the  CCD  memory  was  that  it  had  only  one 
dimension.  By  using  a RAM  with  several  counters  (incrementing/decrementing 
registers)  concatenated  together  and  pointing  to  the  desired  location, 
one  can  create  a pseudo  n-dimensional  torroidal  shift  register  memory. 

The  only  bits  that  need  be  sent  to  the  memory  are  those  required  to 
decode  the  next  shift  direction.  The  memory  then  can  be  visualized  as 
having  mem.ory  locations  laid  out  on  an  n-dimensional  torroid  in  n+1  space 
with  a read  head,  residing  above  the  last  location  referenced,  capable  of 
moving  one  step  in  any  one  of  2n  directions  (depending  on  which  counter 
is  selected  and  the  count  direction).  Figure  12  shows  an  example  of  such 
a memory.  The  m.emory  surface  appears  torroidal,  since  the  muamory  is 
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Figure  12.  Segoiented  Counter  Accessed  Memory  (SCAM). 
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finite  and  circular  in  all  directions.  This  design  should  reduce  the 
location  faulting  encountered  in  the  CCD  memory.  Another  advantage  is  that 
the  packet  size,  v = l+l’log^  nl  , is  independent  of  the  memory  size 
(though  ' may  increase  with  increasing  memory  size).  Good  allocation, 
however,  again  becomes  important  and  difficult.  Also  because  of  the  CPU 
having  to  keep  track  of  the  "read  head"  location  and  move  it  accordingly, 
paging  and  loading  would  be  non-trivial.  Penalties  for  location  faults 
can  be  quite  large  if  n is  kept  small. 

Thus,  although  this  does  have  some  advantages,  we  feel  its 
disadvantages  are  sufficient  not  to  warrant  examination  of  this  type  of 
memory  architecture  any  further  at  present. 


4.3.3.  Segmented  Random  Ascess  Memory  (SRAM) 

Although  this  memory  is  not  a true  M2,  it  has  enough  "second 
order"  features  to  merit  its  study. 

It  is  not  necessary  to  know  the  successors  of  a location  in 
order  to  allocate  an  SRAM.  Optimal  allocation,  however,  is  extremely 
difficult  and  requires  such  knowledge  plus  that  of  the  conditional 
probabilities,  p(m^|mj).  There  are  heuristic  allocations  though,  such  as 
the  HI  allocation,  that  do  a fairly  good  job. 

Basically  an  SRAM  has  an  in-memory  address  register  (Figure  13) 
pointing  to  a RAM.  The  address  register  consists  of  k segments  of  a 
bits  each.  For  each  interaction  v = I"log2kl+a  bits.  These  bits  access 
one  of  the  segments  and  replace  its  contents.  Multiple  interactions 
(at  most  k)  are  required  whenever  multiple  segments  of  the  address  register 
must  be  changed  for  the  ne.xt  reference. 
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Figure  13.  Segmented  Random  Access  Memory  (SRAM). 
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This  type  of  memory  is  a generalization  of  a type  of  architecture 
used  in  dynamic  4k  and  16k  MOS  RAMs,  where  the  address  is  multiplexed 
into  the  memory  chip  in  two  halves  to  save  pins,  but  built  in  "page  mode" 
addressing  allows  a row  (left  half  of  address)  to  be  repeatedly  accessed 
with  only  the  column  (right  half  of  address)  being  specified  for  each 
reference . 

Section  4.4  analyzes  this  memory  in  more  detail,  including  result 
of  an  actual  simulation  of  memory  operation. 

4.3.4.  Successor  Accessed  Memory  (SAM) 

A Successor  Accessed  Memory  is  perhaps  the  most  ideal  M,  of 
all  those  we  have  examined.  It  gives  the  best  performance,  but  is  also 
the  most  difficult  to  use  and  requires  the  most  complex  address  mapping 
hardware  within  the  memory. 

Basically  a SAM  has  an  in-memory  RAM  (see  Figure  14),  which 
stores  data  and  successor  pointers.  It  should  be  noted  here,  that  large 
reductions  in  storage  requirements  are  possible  through  more  efficient 
pointer  encoding.  The  scheme  presented  in  Figure  14  is  a straightforward 
approach  that  allows  a simple  analysis.  Pointers  to  up  to  2 successors 
of  the  last  reference  are  loaded  into  registers.  To  access  a SAM,  v 
bits  are  sent  to  select  one  of  2^  successor  registers,  the  next  location 
is  accessed  and  the  pointers  to  its  successors  are  placed  in  the  registers. 
This  register  loading  could  be  overlapped  with  memory/CPU  data 


communication. 
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The  major  disadvantage  with  the  SA>I  approach  occurs  when  the 

V 

number  of  successors  exceeds  2 . A reasonably  fast  solution  to  this 
problem  and  simulation  of  this  memory  with  our  program  traces  is  presented 
in  Section  4.5. 


Analysis  and  Simulation  of  the  Segmented  Random  Access  Memory  (SRAM) 


As  mentioned  previously  an  SRAM  consists  of  a random  access 


memory  with  an  in-memory  address  register.  This  register  has  k segments  of 
a bits  each  and  contains  the  address  of  the  location  being  accessed  in  the 


memory.  Prior  to  each  access  one  or  more  of  these  segments  may  be  changed. 
For  each  such  interaction  v=  flog2k*J  + a bits  need  be  sent.  For  SRAMs  then 


B = ^(riog2k1  +a). 

Consequently  if  we  have  programs  that  exhibit  sequential 
behavior  and  if  we  can  allocate  the  memory  well,  we  should  expect  a 
fairly  small  X for  a rather  small  a.  Due  to  the  allocation  dependent 
properties  of  this  memory  it  becomes  more  difficult  to  analyze.  However, 
by  assuming  certain  basic  accessing  patterns  the  following  simple  results 
can  be  obtained: 

Theorem  4.4.1:  For  pure  serial  access  of  an  SRAM,  i.e.,  if  location  i is 

accessed,  p(i+lli)  = 1,  then 

X = V 0 (i-k)a 

' s i=r 

Proof : If  we  assume  a k segment  register  with  a bits  for  each  segment 

and  we  are  accessing  successive  locations  n,n+l,n+2, . . . , then  out  of  every 
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2^^  integers  we  change  the  high  order  segment  2^  times,  the  second  highest 
2^^  times  and  the  lowest  2 times  which  can  be  written  as 


^S  = 


2ka  +2^k-l)a  ^ _ _^^2a  _^_^a 
-ka 


= Z 2*'^"k^^. 

i=l 


Lemma  4.4.1:  Let  N(i)  be  the  number  of  storage  locations  that  are  i 

interactions  (segment  loads)  away  from  any  current  location,  then 


N(0)  = 1 
N(l)  = k(2^-l) 

and  N(i)  = (^  (2^-1)^  , Ki<k. 

Proof:  In  i interactions,  i segments  can  be  changed,  each  segment  allcaving 

2^-1  new  locations.  Therefore  there  are  (2^-1)^  possible  addresses. 

However  we  can  change  any  i of  the  k segments  or  (kj  combinations, 
resulting  in  ^k^  (2^-1)^  total  possibilities.  When  i = l the  one  possible 
address  which  requires  no  change  does  require  at  least  one  interaction, 
so 

N(0)  = 1 

N(l)  = k(2^-l),  i = 1.  C 


Theorem  4.4.2:  For  random  and  independent  access, 

Vx,y  p(xjy)  = p(x),  then 


i.e.  p(x)  = -j—  and 
2«<a 


X = 
R 


1 

,ka 


.1  + 
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Proof : At  each  reference  there  are  N(i)  locations  i accesses  away. 

The  probability  of  any  member  of  that  set  occurring  is 

N(i)p(x). 

k 

Therefore  the  expected  number  of  loads  = S ip(x)N(i),  or  by  using 

i=l 

Lemma  4.4.1  and  the  fact  that  p(x)  = — ^ , we  have 


X 


R 


Table  11  gives  the  results  of  the  application  of  Theorems  4.4.1 
and  4.4.2  to  various  memory  configurations  with  ka  = 16.  These  results  are 
intuitively  consistent.  Serial  accessing  does  well  and,  in  fact,  has  a 
lower  B with  more  smaller  segments.  Table  11  seems  to  indicate  that 
sequential  accessing  of  SRAMs  generates  nonzero  overhead  (>.^>1). 

However,  by  loading  in  a zig-zag  manner,  i .e ., changing  between  incre- 
menting and  decrementing  the  lower  order  segments,  each  time  a higher 
order  segment  is  changed  (Gray  code),  Xg  = l. 

Looking  at  the  data  for  random  accessing  though,  it  can  be 
seen  that  for  k'>k,  B,,  i,>B,,  ..  This  result  is  again  as  extracted, 

) (K.) 

since  there  is  no  structure  at  ail  in  the  reference  stream.  An  actual 
reference  stream  fortunately  contains  considerable  structure,  so  we  would 
expect  an  improved  B when  using  this  type  of  memory  architecture,  which, 
as  we  shall  see  from  the  simulation  results,  is  indeed  true. 

It  is  sufficient  to  examine  only  second  order  information 
(1st  order  Markov  process)  when  simulating  the  performance  of  an  M^, 
since  the  cost  of  a transition  from  state  i to  state  j is  dependent  on 


r 


Table  11 


SRAM  Analysis  Results 


^ = 2, a^ 

9.035,  1.004 

17.925,  1.992 


k=4, 

6.400, 

22.500, 


a = 4 
1.067 

3.750 


k = 8,  a = 2 
6.667,  1.333 

30.000,  6.000 


(Serial) 

(Random) 
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being  in  state  i and  not  on  how  one  gets  to  i.  Consequently  we  did  not 
use  higher  order  graphs  in  our  simulation  studies.  Higher  order  graphs 
could,  however,  provide  more  information  for  improved  performance  of 
allocation  algorithms.  However,  due  to  their  comple.xity  such  algorithms 
were  not  considered  further. 

Before  actually  simulating  SRAM  operation,  the  problem  of  memory 
allocation  must  be  investigated.  It  is  important  that  such  a memory  have 
its  contents  ordered  so  as  to  minimize  '■  by  taking  advantage  of  the 
second  order  effects  of  the  reference  stream. 

The  SRAM  allocation  problem  is  similar  to,  but  more  general 
than,  the  program  restructuring  problem  as  examined  by  Ferrari  [23], 
since  we  are  trying  to  improve  "spatial  locality"  by  the  proper  allocation 
of  memory  words. 

Before  formalizing  the  allocation  problem,  the  H2  graph  is 
introduced . 

Definition  4.4.1:  The  H2  graph  is  a directed  graph  consisting  of  a set 

of  nodes  V and  a set  of  directed  arcs  E.  Each  node,  v(i)  s V,  is  a 

word  (data  or  instruction)  referenced  by  a program.  Each  node  is 

labeled  with  the  stationary  probability  of  its  occurrence,  q^,  where 

Z q =1.  Each  arc,  e. . £ E,  represents  a directed  path  from  node  i to 
i 

node  j,  indicating  that  at  least  once  within  the  reference  trace  node  j 
immediately  follows  node  i.  All  arcs  are  labeled  with  the  probability 
of  transition,  P^j > from  node  i to  node  j.  Only  these  arcs  are  included 
withp..>0,andZp..  = l.  ; 
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Let  r be  an  "allocation,"  i.e.,a  mapping  of  program  words 
(nodes)  v(i)  into  the  linear  memory  space  where  jvl  £ 2^^,  and  let 
d(T,i,j)  be  a distance  function  which  is  the  number  of  interactions  in 
an  SRAM  required  under  mapping  7 to  access  location  j when  one  has  just 
accessed  location  i. 

Theorem  4.4.3: 


V = J)p,-  .q,-  • 


Proof : Given  that  we  are  in  state  i (having  just  accessed  node  i),  the 

expected  number  of  memory  interactions  (faults)  encountered  when  leaving 
i is  just 

^ J 

The  expected  number  of  interactions  at  any  instance  is  just 


= E .q,- • 

^ X X - IJ  1 

The  optimal  allocation  problem  then  is  to  find  an  allocation 


such  that 


>T*  ^ V- 


Hq(S) 

This,  however,  is  a difficult  problem,  since  there  are  (2  ) possible 

mappings.  One  can  compute  a lower  bound,  by  assuming  each  location 

can  allocate  its  successors  optimally.  In  other  words,  each  node  i places 
its  N(l)  most  probable  successors  such  that  it  can  access  them  in  one 
interaction,  its  next  N(2)  most  probable  successors  2 interactions 


i 


away,  etc.  Unfortunately  it  was  discovered  that  this  lower  bound  is 
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much  too  loose  to  be  useful,  since  for  all  of  our  traces,  1 and 

LB 

for  any  mapping  ”,  > 1. 

Before  describing  the  actual  simulation  program,  each  of  the 
four  allocations  simulated  will  be  discussed. 

Original  (PRIG) : This  is  the  original  allocation  of  the  program  and  data, 


using  the  addresses  provided  by  the  trace.  To  create  a more  realistic 
model,  this  allocation  was  compressed  to  create  a contiguous  array  of 
words,  removing  the  gaps  created  by  filtering  out  some  referenced  instruc- 
tions and  data.  Also  the  lowest  numbered  word  of  the  allocation  was  assigned 
location  0,  the  second,  location  1,  etc. 

Hi : This  is  the  allocation  as  defined  in  Chapter  3.  This  allocation 

was  chosen  because  it  is  simple  and  quick,  and  though  requiring  approxi- 
mate frequency  measurements,  the  allocation  does  not  need  knowledge 
of  the  set  of  successors  and  their  probabilities.  This  allocation 
technique  should  work  quite  well  if  there  is  only  one  tight  grouping  of 
frequently  occurring  nodes.  In  that  case,  absolute  frequency  would  also 
tend  to  predict  second  order  behavior.  Conversely,  if  one  had  several 
groupings  of  words  that  were  equally  likely,  but  had  very  little  inter- 
communication, the  allocation  would  not  work  as  well.  One  might  then 
use  a separate  allocation  for  each  group  (locality),  although  shared 
instructions  and  data  between  groups  would  then  pose  a problem.  Ke  chose 
the  single  allocation  since  we  feel  that  most  of  our  traces  represented 
a small  number  of  such  groupings. 


Modified  Hi  (MHl):  The  modified  H]_  allocation  starts  with  an  Hj_  allocation 

and  using  the  H2  graph  attempts  by  use  of  an  iteration  algorithm  to 
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improve  the  allocation  iteratively  by  swapping  individual  locations.  The 
main  objective  is  to  see  if  multiple  groupings  of  words  exist,  and  if 
significant  improvement  can  be  wrung  from  the  original  allocation. 

The  iteration  algorithm  operates  on  the  modified  H2  graph. 
Definition  4.4.2:  Define  a matrix  C such  that  each  entry 

c^j  = d (T , i , j )q^p^j  for  each  directed  arc  in  the  graph.  The  modified  H2 
graph  is  an  undirected  graph  such  that  the  arcs  are  those  entries  in  matrix 
c'  where 

C ' = C + . C 

Theorem  4.4.4:  Using  the  matrix  c'  defined  above 


>.  = T 2 . 

2 i,j  ij 


where  c , S C ' . 

Proof:  From  Theorem  4.4.3  we  have 


X = .Z.d (F  i j )q. p. . . 


The  probability  of  arc  (i,j)  occurring,  is  just  j and  also 

d(F,i,j)  = d(T,j,i).  Removing  the  directedness  of  the  arc  by  matrix 
addition,  all  arcs  are  counted  twice,  also  self  loops  are  doubled. 
There  fore 


~T.^.‘^(F,i,j)(q.p..-tq.p..)  Scl. 

2 i,j  " ’ 2 ij  ij 


= \. 


Beginning  with  an  allocation  the  iteration  algorithm  modifies 
r successively  by  looking  at  node  pairs,  i^  and  i., , and  swapping  their 


allocation  if  the  following  is  true 
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d(T,i  ,j)  d(T,i,,j) 

' >I  c'  +S  i c'  . 

i-2j  j d(T,i^,j)  j d(r,i2,j)  i2J 

In  actual  operation  not  all  pairs  are  examined,  just  the  most  frequently 
accessed,  since  generally  a small  group  of  words  were  accessed  more 
frequently  than  the  rest.  It  should  be  mentioned  that  by  using  the 
modified  H2  graph  and  considering  the  fact  that  most  words  have  a small 
number  of  successors,  the  iteration  algorithm  works  very  quickly. 

H2  allocation  (H2):  To  achieve  an  allocation  with  performance  close  to 
one  must  use  as  much  second  order  information  as  possible.  The  H2 
allocation  algorithm  used  here  is  a heuristic  which  attempts  to  group  most 
probable  successors  closely  together.  A heuristic  was  chosen  over  an 
optimal  algorithm,  because  of  the  difficulty  involved  in  constructing  such 
an  optimal  algorithm.  The  heuristic  allocates  fairly  well  and  does  so 
with  a reasonable  complexity,  0(n  -n  -log  n ),  where  n = |v  and  n = !e 
Of  course  it  does  need  the  graph  with  its  associated  probabilities. 

Optimal  allocation  is  essentially  a k level  clustering  problem 
(all  clusters  at  a level  being  of  equal  size).  Most  algorithms  presented 
for  clustering  graphs  are  strictly  for  k = 2 (Ferrari  [24]).  Optimal  R, 
allocation  is  a difficult  problem  for  large  values  of  k,  and  would 
probably  not  be  a cost  effective  procedure  in  a general  purpose  computing 
environment.  Hence  we  have  chosen  a reasonably  quick  heuristic  instead. 

Assume  that  each  node  in  the  H2  graph  is  labeled  with  its 
steady-state  probability,  q^,  .and  that  each  arc  is  labeled  with  the  arc 


Z c : . +£  c 

j hJ  j 
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probability,  ~ • Further  assume  that  arcs  and  nodes  are  indexed 

so  that  w..^w. Vj  andq,>q.,,  , Vi. 

H2  algorithm:  (To  make  this  discussion  simpler,  it  is  assumed  here  that 

the  length  of  the  original  list  of  nodes  to  be  allocated,  n , is  an  integer 

V 

power  of  two.  Of  course  the  Igorithm  as  implemented  actually  operates  on 
any  integer  n^.)  The  algorithm  basically  takes  the  original  list  of  nodes 
and  splits  it  in  two.  It  then  splits  each  half  recursively  until  all  lists 
are  of  length  one.  When  a split  is  made,  the  first  half  of  the  list  has  a 
"0"  appended  to  an  address  string  that  exists  for  each  node  (word).  The 
second  half  of  the  list  has  a "1”  appended  to  node's  string.  When  all 
lists  are  of  length  one,  each  address  string  will  contain  a binary  number 
that  is  that  word's  address.  Essentially  we  are  allocating  for  an  SRAM 
with  k = |"log2nJj  and  a = l under  the  assumption  that  such  an  allocation 
would  also  be  valid  for  any  other  value  of  k and  a (such  that  ka  > Hq(S)). 
The  lists  are  actually  split  as  follows: 

Step  1 : Take  the  first  available  node  of  the  original  list,  the  most 

probable,  remove  it  from  the  original  list,  and  place  it  on 
the  split  list.  Copies  of  its  successor  pointers  are  then 
placed  in  an  ordered  list  of  successors,  most  probable  first. 
Step  2:  Take  the  first  element  on  the  successor  list,  the  most 

probable  arc,  and  attach  the  node  pointed  to  to  the  split 
list  (removing  that  node  from  the  original  list)  and  merge 
copies  of  its  successor  pointers  onto  the  successor  list. 


such  that  the  successor  list  remains  an  ordered  list 
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(most  probable  first)  and  that  only  those  pointers  are 
included  that  point  to  nodes  renaining  in  the  original  list. 
If  no  successors  are  left  on  the  successor  list,  start  a new 
node  as  in  Step  1 and  go  to  Step  2. 

Step  3 : If  the  length  of  the  split  list  is  less  than  the  length  of 

the  updated  original  list  then  go  to  Step  2 else  terminate 
the  split  process . : 

The  above  algorithm  does  have  two  weaknesses.  First,  it  does  not  combine 
arcs  which  emanate  from  the  split  list  and  point  to  a common  node.  This 
would  have  required  extensive  searching  of  the  successor  list  for  each 
node  split.  Therefore  it  was  not  implemented.  The  second  problem 
concerns  the  manner  in  which  the  split  list  is  grown.  The  technique  used 
by  the  algorithm  will  certainly  capture  a single  tightly  bound  grouping  of 
nodes.  It  does  not,  though,  appear  to  handle  other  lesser  groupings 
adequately.  This  again  was  done  for  the  sake  of  lower  complexity.  A 
better  algorithm  would  probably  try  to  grow  from  multiple  seeds 
s imultaneous ly . 

To  perform  the  allocations  mentioned  above,  a procedure  was 
added  to  the  H012  program,  since  H012  uses  an  internal  representation  of 
the  H2  graph.  The  various  allocations  are  placed  in  a file,  to  be  read  by 
the  SRAM  simulator.  This  file  contains  the  actual  address  or  name  of  each 
word,  as  referenced  by  the  trace,  and  a location  in  the  SRAM  given  to  it 
by  each  of  the  four  allocation  routines.  These  allocations  are  read  into 
an  associative  table  within  the  simulator,  using  the  original  address  as 
key.  The  simulator  then  reads  the  computation  reference  trace,  keeps  a 
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register  for  each  allocation,  indicating  the  last  location  referenced,  and 
calculates  ?-  as  the  average  number  of  interactions  required  to  access  the 
next  reference.  The  simulator  was  run  for  each  set  of  k and  a. 

In  addition  to  computing  X for  each  allocation,  the  simulator 
calculated  the  X required  by  the  SRAM  if  1,  2,  or  4 separate  address 
registers  existed  in  the  memory  (for  each  allocation).  With  more  registers 
in  the  mem.ory,  the  CPU  can  have  registers  pointing  to  various  competing 
word  groupings.  For  all  configurations  then 

B = X[r +a +flog^k|  J 


bits  are  sent  each  time,  where  r= 0,1,2  (register  select  bits).  For 
multiple  register  SRiUis  the  simulator  always  updates  the  register  that 
requires  the  least  number  of  interactions  to  change  it  to  the  new  address. 
In  case  of  a tie,  the  choice  is  just  the  first  minimum  change  register 
encountered . 

Table  12  shows  SRAM  simulation  results  for  GAUSS  and  LIST.  The 
bits /interaction  column  is  just  v = j”log-kj  +a+r.  For  the  H allocation, 
the  trace  was  split  into  an  instruction  and  a data  reference  stream.  These 
streams  were  also  run  through  the  simulator,  giving  performance  equivalent 
to  that  of  placing  instructions  and  data  in  separate  memories.  The 
iteration  algorithm  was  only  run  for  a SRAM  with  one  register. 

The  following  conclusions  can  be  derived  from  the  simulation 


results  : 

i)  For  most  configurations,  B was  generally  close  to  the  lower  bound, 
V = r +a  + Tlog^kl  , and  for  all  configurations  B was  much  lower  than 
the  original  value  calculated  for  the  360  architecture. 
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Table  12 

SPv^  Simulatior.  Results 
■}  of  , Bits/ 


GAUSS : 

Registers 

Interaction 

1 MHl 

H2 

ORIG 

HI 

HI  Inst . 

Hi  Data 

k=2 , a=6 ; | 

1 

7 

00 

10.3 

11.9  1 

10.5 

7.2  1 

; 10.0 

2 

8 

- 

10.8  ' 

10.4 

9.9 

8.2 

10.2 

4 

1 

9 

- 

10.6 

10.1 

9.8 

9.1 

10.2 

k=4;  a=3; 

1 

5 

9.2 

10.6 

14.1 

10.5 

6.1  1 

9.6 

2 

6 

- 

10.8 

10.3  i 

9.3 

6.9  1 

10.0 

4 

- 

10.5 

9.9  ' 

9.6 

1 

8.2 

1 

10.: 

LIST: 
k=2;  a=7; 


k=4;  a=4, 


-X  of  Bits/ 

Registers  I Interaction 


1 

2 

4 


1 

2 

4 


8 

9 

10 


MKl 


9.6 


10.8 


9.5 

10.2 

10.8 


9.5 

10.3 

10.9 


ORIG 


12 . 9 

11.9 

12  .0 


13.1 

11.6 

11.5 


Hi 


11.5 

11.5 

11.6 


Hi  Inst. 


11.9 


8.0 

8.9 

9.6 


9.2 


HI  Data 


B (bits/computation  reference) 


B for  Original  Trace 


GAUSS 

LIST 


47,3 

39.9 


Ho(S) 


11.27 

10.36 


H^(3)  j H^CS) 


9.58 

7.88 


r 


w 


since  vith  mere  registers,  pointers  to  commonly  occurring 
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tightly  bounded  groups  of  words  can  be  efficiently  kept,  thereby 
lowering  the  fault  rate. 

vi)  It  was  found  with  the  most  highly  segmented  memory,  a = l,  that  the 

fault  rate,  t,  and  B increased  drastically.  Therefore  this  case  was 
not  considered  further. 

In  conclusion,  the  SRAM  is  difficult  to  allocate  optimally, 
though  with  simple  heuristics  like  our  H allocation  algorithm,  ^ is 
close  to  one.  The  main  problem  with  the  SRAM,  however,  is  that  v itself 
is  fairly  large.  Since  v alone  is  greater  than  H^CS),  it  would  be  sense- 
less to  try  to  incorporate  more  second  order  information  into  the  alloca- 
tion as  an  optimal  allocation  would  do.  Basically  then,  the  mapping 
mechanism  itself  requires  too  much  overhead  to  allow  the  SRAM  to  approach 
H2(S). 

Before  concluding  this  section  a brief  word  is  necessary 
concerning  indexing  by  a CPU  into  a SRAM.  The  problem  occurs  when  accessing 
a complete  array  requires  several  segment  changes.  Consequently,  when  an 
index  instruction  is  executed,  the  number  of  interactions  with  memory  will 
depend  on  the  value  in  an  index  register.  This  problem,  however,  can  be 
solved  fairly  easily  by  adapting  the  CPU  so  that,  depending  on  the  value 
in  the  index  register,  it  takes  the  appropriate  action  to  move  the  SRAM  from 
the  current  state  to  the  desired  state  independently  of  the  instruction 
stream.  Th’s  same  approach  can  be  used  for  multi-level  indirection. 

In  conclusion,  StlAMs  yield  a marked  improvement  over  an  M^  with 
v=w.  However,  SRilMs  do  not  approach  the  ideal  performance  for  an  >1, . 
Actually  their  required  bandwidth  is  fairly  close  to  that  of  a Huffman 
encoded  Mj^  with  v=l.  This  capability  should  not  be  understated  since 
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it  yields  a significant  improvement  over  the  original  value  of  B. 
Consequently,  we  feel  this  memory  could  be  applicable  to  a general 
purpose  computing  environment.  It  increases  addressing  efficiency 
(e.g.  no  base  registers  need  be  maintained)  and  it  is  relatively  easy 
to  use  . 

In  the  next  section  we  examine  a mem.ory  which  is  almost  an  ideal 
M2  and  therefore  has  a much  smaller  addressing  overhead. 


4.5.  Analysis  of  the  Successor  Accessed  Memory  (SAM) 

The  SAM  is  a random  access  memory  with  a specialized  mapping 
mechanism.  Essentially,  the  CPU  addressing  mechanism,  has  been  completely 
moved  into  the  m.em.ory . In  addition,  wo  give  this  mechanism  program,  specific 
knowledge  of  second  order  reference  stream  behavior.  This  knowledge  has 
the  distinct  advantage  of  not  only  allowing  reduction  of  the  address 
computation  requirements  of  the  CPU,  but  also  of  the  communication  require- 
ments between  the  CPU  and  memory.  The  mechanism  discussed  here  then  is 
just  a particular  kind  of  mapping  mechanism  which  takes  advantage  of  second 
order  efficiency  discussed  earlier. 

A very  important  difference  between  this  memory  and  the  SRr\M,  is 
that  the  SAM  does  not  have  a global  allocation  problem.  However,  there  is 
a coding  problem  associated  with  each  reference's  successors.  If  we  have  v 
address  bits  per  transaction,  we  choose  from  possible  successors  for 
each  reference.  Since  most  references  have  at  most  one  or  two  successors. 


we  would  not  want  v to  be  too  large.  The  problem  is,  v;hat  do  we  do  when 
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the  number  of  successors  for  a particular  reference  is  greater  than  2 ? 

In  this  case,  some  indirection  method  must  be  used  by  the  mapping 
mechanism.  This  can  be  accomplished  by  having  a bit  in  each  data  word, 
which  when  set  indicates  that  the  word  contains  no  data  and  is  only  a 
dummy  word  providing  another  level  of  indirection  to  enable  a greater 
number  of  successors  to  be  accessed  (see  Figure  15).  This  technique  does 
require  the  CPU  to  be  aware  of  the  path  it  must  take  to  access  the  desired 
successor.  Possible  solutions  to  this  and  the  indexing  problem  will  be 
discussed  in  more  detail  at  the  end  of  this  section. 

Since  various  successors  to  a particular  reference  may  have 
different  probabilities  of  occurrence,  the  best  way  to  allocate  indirect 
pointers  to  minim.ize  the  expected  number  of  levels,  is  to  use  Huffman 
encoding.  In  fact  from  coding  theory  we  have  the  following  theorem: 
Theorem  4.5.1:  A Successor  Accessed  Mcsiory  can  be  allocated  (using 

indirect  multi-level  addressing)  such  that 

H2(S)  1 i:  H2(S)  +v 

V 

where  2 is  the  number  of  immediate  successors  available  for  each 
referenced  word,  and  v is  the  nu.mber  of  address  pins  used  by  the  memory. 
Proof : From  Gal  lager  C25]  we  have 


,,,  H(S)  . - ^ H(S)  , , 

(1)  T A'  n < ■ ■ ^ 'C'  + 1 
log^D  - - log^D 

where  n is  a coding  rate  and  D is  the  alphabet  size  being  used.  Now  let 


B,  = I p(m  )-B(m  ) 
j J J 


J 


8 


where 


B (m. ) = E p (m. I m. ) -n . I . 

J i 1 J i1j 


being  the  length  of  the  code  word  (in  bits)  needed  to  code  the 
reference  to  location  given  that  has  just  been  accessed,  and 

Ho(S)  = ? p(m.)-H(slm.) 

“ j J J 

where  H(S  m.)  = Z p (m.  | t..  ) log„p  (n.  | n.  ) . 

V V 

Our  actual  code  alphabet  size  is  D = 2 , i.e.,we  can  choose  from  2 
successors  with  each  reference,  so  by  applying  (1) 


H(s|m.) 


n (iTij  ) < 


H(s|mJ 


where  n(m.)  = Z p(m.  ra.)*nl'|  n'.' i , being  the  length  of  the  code  (in  words 

of  V bits  each)  required  to  code  given  that  has  just  been  referenced 
Since  B (tn^ ) = v.n(raj)  we  have  (multiplying  by  v) 

H(s|m.)  <'  B(m.)  < H(s!m.)  +v. 

J - J J 

Multiplying  all  inequalities  by  p(rrij)  and  summing  over  j we  have 

Z p(m.)H(S|m.)  < Z p(m,)B('m.)  < (Z  p(m.)H(s|m,))+v.Z  p(m.) 
j J J j J J J J J j J 

or 

H2(S)  < 3,  < H2(S)  +v.  : 

The  above  theorem  tells  us  tv/o  things:  first,  that  we  can  bound 
^2  from  below  and  above,  and  second,  that  the  smaller  v we  use  the  better 
(though  with  smaller  v the  SAM  would  become  more  program  structure 
sensitive.)  This  scheme,  of  course,  assumes  complete  knowledge  of  the 
conditional  probabilities  p(m.lm.).  What  we  would  like  is  a heuristic 


algorithm  which  only  has  knowledge  of  successors  and  possibly  some 
partial  knowledge  of  probability  distributions.  Such  a heuristic  does, 
in  fact,  exist  and  still  gives  smallest  B with  v=l. 


The  heuristic  we  used  is  as  follows.  If  all  successors  are 
equally  likely,  which  is  often  true  for  instructions  fetching  data  from 
an  indexed  array,  then  the  optimal  allocation  is  a balanced  tree  (see 
Figure  16a).  Mow  if  one  location  dominates  all  the  rest  (each  next  proba- 
bility is  greater  than  the  sum  of  all  those  remaining),  then  a sequential 
allocation  is  optimal  (see  Figure  16b).  Our  heuristic  uses  only  sequential 
or  balanced  tree  allocation.  The  only  decisions  it  must  make  are  when  to 
use  sequential  and  when  balanced  tree  allocation,  and  if  using  sequential 
allocation,  in  what  order  should  the  successors  be  allocated.  For  our 
analysts  the  estimated  probabilities  p(m^lm^)  are  known,  the  decisions 
being  made  were  therefore  based  on  those  p(m^lraj).  Furthermore,  it  is 
not  unreasonable  to  expect  a compiler  to  be  able  to  estimate  whether 
sequential  or  balanced  tree  allocation  should  be  used  for  a particular 
reference  based  on  the  type  of  statements  being  compiled  at  that  instant. 
For  example,  array  references  would  almost  always  be  allocated  as  a 
balanced  tree,  instruction  references  sequentially.  It  should  be  noted 
that  our  results  are  slightly  better  than  would  be  expected  in  an  actual 
environment  as  a consequence  of  the  above  assumption  of  knowledge  of  the 
p(m^|mj).  Of  course  if  a program  were  to  be  run  many  times,  post  factum 
optimization  could  be  used  very  effectively. 

Our  analysis  is  of  the  K2  graph  (as  defined  in  Section  4.4) 
rather  than  an  actual  simulation  using  the  program  traces.  As  with  the 


• Indirect  Pointer  with  No  Data 
o Data  with  Successor  Pointer 


(a)  Balanced  Tree  Allocation 


(b)  Sequential  Allocation 


Figure  16.  SAM  heuristic  allocation  (v  = l). 
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SRAM,  it  is  not  necessary  to  have  greater  than  second  order  knowledge  ^ 

of  the  process  to  simulate  an  M2.  With  our  analysis  we  find  X Q.  estimate)  'j 

~ 

as  defined  in  Definition  4.2.2  and  then  B where  ' 

B = v.X  . 

For  balanced  tree  allocation,  the  expected  number  of  levels  (accesses) 
required  to  fetch  reference  i's  successor  is 


X (i)  = log  SCi) 
TREE 


(1) 


where  S(i)  is  the  number  of  successors  to  location  i.  The  expected  number 
of  levels  required  to  access  i's  successor  with  sequential  allocation  is 


S(i) 


(2'"-l) 


X(2^-l) 


j=(.e-i)(2''-i) 


(2) 


where  the  o..  are  indexed  such  that  p..  > p.,.,,,  for  all  i.  Equation  (2) 

■ IJ  IJ  — l(]+l)  J V y 


is  derived  from  the  fact  that  (2  -1)  data  elements  are  allocated  at  each 


level  with 


■sg) 


(2'"-l) 


levels  required. 


The  expected  number  of  levels  averaged  over  the  stationary 


probability,  q^,  of  each  reference  is 


= S c * X ("  i 1 
TREE  i ^i  TREE^  ^ 


'SEQ  i '^i''''SEQ‘'^^ 


'BOTH  i °i  '"^'^■^'TREE^^^ ’■'SEQ*'^^' ■ 


(3) 

(4) 

(5) 


The  '-ggQCt)  calculation  is  inaccurate  in  one  minor  respect,  if 


S(i)  = k(2  -1)+1  for  some  k,  formula  (2)  will  allocate  one  level  greater 
than  is  necessary,  since  it  will  place  the  last  element  on  a new  level 
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with  a pointer  from  the  previous  level  pointing  to  it.  This  approach 

over  estimates  „(i)  slightly,  but  does  make  the  computations  easier. 

SEQ 

The  above  calculations  were  actually  done  by  a procedure  added 
to  the  H012  program,  since  H012  has  an  internal  representation  of  the 
graph. 

The  results  of  these  calculations  for  GAUSS  and  LIST  (the  worst 

and  best  programs)  are  given  in  Table  13  for  1 < v <_  3 . These  results  are 

quite  interesting.  Over  all  cases  is  much  less  than  either  or 

BUIH  jlREE 

X , this  indicates  that  generally  there  is  no  predominance  of  one  type 
bEQ 

of  distribution  over  another.  Thus  the  use  of  a selection  mechanism  is 

justified.  Furthermore,  even  with  this  heuristic  allocation  technique, 

the  optim.al  B is  with  v = l,  and  for  all  traces  B < H2  (S)  + 1 with  v=l. 

Of  course  X decreases  monotonically  with  increasing  v.  This 

iSU  i H 

means  that  although  addressing  efficiency  is  greater  -with  the  smallest  v 
(with  v=l,A  would  be  only  a fev;  bits/ccmputation  reference),  speed  would 
decrease  accordingly.  In  a distributed  multi-processor  environment, 
however,  this  may  not  be  a serious  problem,  since  we  have  considerably 
reduced  the  communication  requirements  between  the  CPU  and  memory  modules. 
If  it  were  a problem,  higher  values  of  v should  be  considered. 

As  with  the  SRAM,  the  reference  stream  was  split  into  an  instruc- 
tion reference  stream  and  a data  reference  stream.  Each  substream  was 
tested  with  its  own  memory.  The  instruction  stream  did  very  well  with 
faulting  occurring  rarely  even  with  v=  1.  This  was  expected  because  of  the 
highly  sequential  nature  of  the  instruction  reference  stream.  For  the 
data  stream,  the  results  were  not  nearly  as  good.  This  was  also  expected. 


Table  13 

SAM  Analysis  Results 


LIST 

Combined  Stream 

Instruction  Stream 
Data  Stream  - 

GAUSS 

Combined  Stream 

Instruction  Stream 
Data  Stream  - 


V = 1 

v=2 

li 

> 

BOTH 

1.40 

2.23 

3.13 

SEQ 

1.75 

2 .40 

3.19 

TREE 

1.75 

2.55 

3.13 

BOTH 

1.04 

2.01 

3.00 

BOTH 

2.00 

2.57 

3.33 

BOTH 

2.75 

3.50 

4.13 

1 

SEQ 

19.39 

14 . 04 

10.52 

TREE 

3.23 

4.00 

4.63 

BOTH 

1.01 

2.00 

3.00 

1 

1 

BOTH 

3.88 

4.13 

4.33 

(bits /computation  reference) 


B for  Original  Trace 


I H„(S)  I H^(S)  I H^CS) 


II  .27 
10.30 
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since  data  is  rarely  fetched  sequentially  and  shows  much  less  structure 
than  the  instruction  stream.  By  using  two  separate  memiOries  we  would 


achieve  some  speed-up;  i.e., 


for  all  references  would  be  much 

A VKRAGE 


smaller.  However,  we  then  would  need  another  bit  to  determine  v/hich 


memory  we  are  accessing,  thereby  increasing  the  bits  sent  per  interaction 
by  one.  The  net  result  would  be  a slightly  faster  configuration  than  a 


single  memory,  with  a slightly  greater  addressing  bandwidth  than  a single 


memory . 

Thus  the  main  objective  of  this  chapter,  to  find  a second  order 
memory  structure  with  3 close  to  H.,(S)  is  realised  with  the  S.A.M.  T!ie  SAM 
should  therefore  be  considered  seriously  as  a system  component,  and  a 
careful  study  of  its  other  ttributes  in  a system  context  is  justified. 

One  important  question  that  should  be  addressed  concerns  the 
time  and  space  trade-offs  between  a CPU  with  this  type  of  memory 
architecture  and  the  standard  360  architecture  with  random  access  memory. 
Table  14  gives  the  space  (in  bits)  and  time  (in  references)  requirements 
for  each  implementation  for  the  LIST  and  GAUSS  programs.  The  time 
requirements  were  calculated  directly  from  the  trace  statistics  collected 
by  our  analysis  programs.  The  bit  requirements  for  the  unfiltered 
programs  are  also  computed  from  program  statistics.  The  bit  requirements 
for  the  filtered  programs  running  on  a SAM  are  the  number  of  bits  required 
to  store  the  filtered  program  plus  the  bits  required  to  store  all  successor 
pointers.  The  number  of  pointers  was  calculated  while  H012  calculated  X 


and  B for  the  SAM. 
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It  should  be  noted  that  the  successor  pointers  could  be  coded 
much  more  efficiently  to  achieve  considerable  space  savings  over  those 
requirements  given  in  Table  14.  For  example,  by  using  default  values  and 
relative  displacement  instead  of  absolute  pointer  values,  the  total  bit 
requirements  could  be  reduced  significantly. 

The  results  (for  LIST  and  GAUSS)  in  Table  14  are  quite  interesting. 
For  LIST,  there  is  roughly  a 65%  increase  in  space  requirements,  but  no 
increase  in  execution  time.  In  fact,  e.xecution  time  is  shortened  slightly. 
For  GAUSS,  however,  there  is  a tremendous  increase  in  space  requirements 
(roughly  three  tim.es)  and  execution  time  (roughly  doubled).  This  is  the 
worst  case  for  all  our  traces,  and  appears  to  stem  from  the  increased 
overhead  resulting  from  indexing  into  large  arrays.  It  is  our  feeling 
that  if  some  simple  inde.xing  capabilities  existed  in  the  SAM,  much  of  the 
above  overhead  would  disappear,  perhaps  approaching  the  rcquirem.ents  for 
LIST. 


The  indexing  problem  and  the  related  area  of  indirect  addressing 
are  indeed  nontrivial  problems  with  the  SAM.  Solutions  do,  however,  exist 
The  indexing  problem  can  best  be  described  -with  the  following  example;  if 
two  distinct  portions  of  the  instruction  stream  access  the  sam.e  array, 
then  each  referencing  instruction  must  not  only  work  its  way  down  the 
balanced  tree  to  the  array  elem.ent  that  is  needed,  but  m.ust  also,  upon 
leaving  the  array,  make  a choice  that  leads  it  back  to  the  proper  place  in 
its  respective  instruction  stream  or  "ramp."  This  problem  is  furtlicr 
complicated  by  the  fact  that  the  particular  array  element  accessed  depe.nds 
on  the  value  in  the  index  register.  One  solution  would  bo  for  the  program 
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to  give  data  to  the  CPU  which  tells  the  CPU  the  approximate  size  of  the 
array.  From  this  data  and  the  contents  of  the  index  register,  it  would 
not  be  difficult  for  the  CPU  to  automatically  generate  the  interactions 
required  to  fetch  the  desired  array  element.  Array  size  data  could  also 
be  stored  in  the  dummy  (empty  data)  fields  of  the  indirect  pointers.  This 
approach  could  also  apply,  with  some  modification,  to  indirect  addressing. 
However,  computed  addresses,  so  it  seems,  would  have  to  be  excluded  from 
systems  using  SAMs.  These  ideas  are  not  meant  to  be  taken  as  specific 
solutions  to  the  SAM  indexing  problems.  Rather  they  merely  demonstrate 
that  the  indexing  problem  need  not  be  a major  obstacle  in  the  use  of  this 
type  of  memory. 

It  is  quite  possible  that  the  form  tnac  a second  order  memory 
should  take  would  use  the  ramp  a.s  a basic  unit.  The  CPU  would  receive 
ramps  from  the  memory  as  the  primary  form  of  instruction  communication. 

The  currently  executing  ramp  within  memory  would  then  automatically  set 
up  its  own  addressing  structure  including  indexing.  The  ramp's  individual 
index  rate  could  be  added  automatically  when  a ramp  is  entered  allowing 
immediate  array  access  without  CPU  intervention.  Since  each  ramp  would 
have  its  own  index  storage,  accessing  arrays  common  to  many  ramps  would 
be  trivial.  .Also  since  the  ramp  structure  would  carry  knowledge  of 
previous  executions  of  a ramp,  enough  information  should  be  available  to 
lower  B below  H2(S),  possibly  even  H(S) ! Lowering  B below  H(S)  would  be 
possible,  since  the  memory  suggested  above  would  actually  keep  more 


inform.ation  on  data  referencing  behavior  than  we  allowed  in  the  H(S)  model 
of  Section  2.4.  Also,  a ramp  structured  memory  would  require  considerably 
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less  pointer  storage,  since  only  pointers  to  successor  ramps  would  be 
necessary.  Of  course,  other  addressing  information  would  be  required  for 
data  accessing  within  the  ramp,  but  the  tremendous  pointer  storage  require- 
ments of  the  SAM  would  simply  not  exist. 

In  fact,  with  the  ramp  memory,  many  accesses  may  be  effected 
with  only  one  interaction.  This  fact  may  make  it  attractive  to  share  the 
memory  among  several  processes.  Such  sharing  would  allow  the  memory  unit 
to  be  used  in  a multiprocessing  environment. 

In  concluding  this  section,  it  must  be  noted  that  not  only  are 
SAM-like  memories  attractive  for  general  purpose  environments,  but  they 
could  also  be  useful  in  special  purpose  environments.  Already  the 
Fairchild  F8  microprocessor  has  a limited  form  of  on-chip  addressing 
incorporated  into  the  memory  chips  [26].  This  consists  of  an  auto- 
incrementing register  which  is  generally  used  as  a program  counter. 

Essentially,  with  the  SAM  we  have  traded  off  storage  space  for 
decreased  bandwidth,  l.e.  we  can  increase  the  functicnal  capabilities  of 
the  chip  and  reduce  its  address  pin  requirements  to  one  pin.  For  the 
LSI  system  designer,  this  trade-off  would  appear  to  be  quite  inviting. 

4.6.  Sum.mary  and  Conclusions 

In  Chapter  3 we  showed  that  extremely  efficient  addressing  cannot 
be  achieved  unless  at  least  second  order  reference  stream  information  is 
used . 
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In  this  chapter  we  have  examined  higher  order  memory  architectures 
which  attempt  to  decrease  addressing  overhead  by  taking  advantage  of  this 
second  order  structure.  Several  examples  of  second  order  memories  were 
presented,  two  of  which,  the  Successor  Accessed  Memory  (S-^0  and  the 
Segmented  Random  Access  Memory  (SRAM)  were  analyzed  and  their  operation 
evaluated  by  simulation. 

The  main  conclusion  we  can  derive  from  the  results  is  that  more 
efficient  addressing  can  indeed  be  achieved  by  using  a memory  system  which 
is  designed  to  exploit  second  order  reference  stream  behavior.  The  SRAM, 
which  is  easily  implementable  and  uses  second  order  information  to  a 
limited  extent,  achieves  a significant  improvement  over  the  current  state 
of  the  art.  Furthermore,  the  SAM,  which  uses  more  refei'ence  stream 
information  performs  many  times  better  than  the  SRLM,  but  requires  us  to 
extract  the  required  accessing  information  during  compile  tim.e  and  store 
it  within  the  memory. 

The  main  result  of  this  chapter  is  that  through  increased 
mamory  complexity,  addressing  bandwidth  and  hence  addressing  overhead 
can  be  reduced  significantly  (more  than  an  order  of  magnitude  for  cur- 
traces).  It  is  our  belief  that  this  type  of  memory  does  have  a place 

in  the  distributed  intelligence,  LSI  computer  systems  of  the  future.  In 
a sense,  the  major  portion  of  the  addressing  function  can  now  be  placed 
within  the  memory  where  it  belongs  and  the  oft  casually  referred  to 
"intelligent  memory"  may  becom.e  a reality. 
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CHAPTER  5 
CORCLUSIOE 

5.1.  Sur-Tiarv  and  Conclusions 

In  Chapter  2,  the  addressing  and  computation  processes  within 
an  executing  CPU  were  defined.  From  this  followed  the  definitions  of 
the  addressing  overhead.  A,  and  the  inf orination  content  of  the  computation 
process  memory  reference  stream,  H(S).  It  was  shown  that  A>  HI'S),  making  A 
vs.  H(S)  a good  measure  of  addressing  process  efficiency.  Techniques  were 
then  developed  for  obtaining  .A  (estimated  A)  and  H(S)  (estimated  II(S))  from 
actual  program  execution  on  an  IBM  360  A system,  of  program.s  was  presented 
which  calculated  A,  H(S),  as  well  as  the  low  order  entropies;  Hq(S),  H^(S), 
and  H.,(S).  The  results  of  these  calculations  were  then  examined.  They 
showed  that  .A  was  considerably  larger  (usually  an  order  of  magnitude  larger) 
than  H(S)  for  all  traces.  A v/as , in  fact,  even  larger  than  Hq(S)  and  H^(S), 
demonstrating  considerable  coding  inefficiency.  Perhaps  the  most  interesting 
result  concerned  the  second  order  entropy  K2(S)  which  was  very  close  to 
H(S)  for  m.ost  programs  and  m.uch  srialler  than  Hj^(S)  for  all  program.s.  This 
indicates  that  second  order  kno-.jlecige  of  a program's  behavior  fairly 
completely  identifies  the  program's  structure  and  is  therefore  quite 
useful.  In  concluding  Chapter  2 it  wa.?  demonstrated  that  for  one  typical 
program,  addressing  overhead  accounted  for  som.e  58.5%  of  the  total  bit 
stream  entering  the  CPU. 

Chapter  3 discussed  som.e  methods  for  reducing  A.  A model  of  t''.e 
address  register  load  process  for  an  P+D  architecture  was  developed  which 
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demonstrated  that  addressing  could  be  done  more  efficiently  (for  our 
traces)  if  the  number  of  address  registers  and/or  the  displacement  were 
reduced  in  the  IBM  360  architecture.  It  was  shown  that  R+D  addressing 
is  reasonably  efficient,  but  sensitive  to  proper  address  register 
allocation.  The  problem  of  such  allocation  was  examined  briefly  and  was 
demonstrated  to  be  a difficult  problem  to  do  well.  In  concluding  the 
chapter  the  possible  improvements  to  GAUSS's  addressing  overhead  when  using 
the  techniques  presented  in  the  chapter  were  analyzed.  It  was  discovered 
that  for  GAUSS,  A could  be  reduced  by  approximately  59%  (to  about  Hq(S)) 
by  a variety  of  techniques.  It  was  further  concluded  that  without  major 
architectural  changes  this  improvement,  while  significant,  is  about  tlie 
best  that  one  could  hope  for. 

Chapter  4 then  looked  at  some  more  radical  architectural  changes. 
In  particular  the  concept  of  second  order  memory  was  introduced.  This 
me.mory  architecture  takes  advantage  of  the  second  order  behavior  discussed 
in  Chapter  2,  to  reduce  addressing  overhead.  H.’o  examples  were  analyzed 
in  detail.  One  was  a moderate  approach,  the  Segmented  Random  Access 
Memory  (SRAM),  which  is  not  too  difficult  to  use  and  approaches  K^(S) 
in  performance.  The  second  example,  the  Successor  Accessed  Memory  (S.4.M) , 
is  a true  second  order  memory  which  approaches  H2(S)  in  performance. 

This  memory  is  unfortunately  more  difficult  to  use,  but  does  represent 
a considerable  further  gain  in  addressing  efficiency. 

The  main  contributions  made  by  this  research  include: 
i)  .addressing  architecture  can  be  more  closely  matched  to  program 
behavior  reducing  much  unnecessary  overhead. 
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j ii)  Memories  can  be  built  which,  by  using  knowledge  of  program  behavior, 

require  only  a small  fraction  of  the  CPU/Memory  bandwidth  current  1" 
being  used. 

iii)  More  concise  models  of  a memory  reference  stream  are  available,  which 
incorporate  knowledge  of  higher  order  reference  stream  behavior, 
iv)  Program  size,  through  increasing  address  efficiency,  can  be  reduced. 
This  reduction  could  be  even  more  significant  when  coupled  with 
Hehner's  work  [2]  on  opcode  and  data  size  optimization. 

In  conclusion,  to  obtain  a very  large  improvement  in  CPP/Memory 
bandwidth  and  CPU  addressing  efficiency,  iiigker  order  memories  and 
accompanying  radical  changes  in  CPU  architecture  and  system  software  are 
necessary.  The  potential  payoff  looms  ever  larger  as  computer  memory  and 
addresses  continue  to  grow  in  size,  LSI  chips  become  larger,  and  inter- 
connection costs  increase  relative  to  logic  costs. 

5.2.  Topics  for  Furth.er  Research 

Though  some  have  been  briefly  discussed  previously,  this  section 
presents  some  areas  for  further  research. 

i)  Application  of  the  information  theoretic  approach  to  the  computation 
process.  Such  research  could  examine  the  efficiency  of  the 
communication  link  between  the  computation  process  and  memory. 
Probabilistic  characterization  of  the  computation  process  would  be 
required.  This  approach  then  would  eventually  lead  to  a reduction 
of  the  graph  in  both  size  of  program  and  e.xecution  time. 
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ii)  Improved  allocation  algorithms  for  program  compilation.  This 

would  include  not  only  heuristics  to  solve  the  problem  as  discussed 
in  Chapter  3,  but  also  work  in  the  area  of  adaptive  restructuring  of 
compiled  programs  for  improved  addressing, 

iii)  Application  of  the  techniques  presented  in  this  dissertation  to  the 
analysis  of  multi-level  memory  hierarchies  and  paging  environments, 

iv)  Analysis  of  the  performance  improvements  possible  in  special  purpose 
computing  environments,  where  programs  are  compiled  once  and 
executed  many  times.  This  would  also  include  applications  to  micro- 
programmed control  structures  within  general  purpose  machines, 

v)  Improvement  of  second  order  r..amories  by  investigating^  among  other 
things,  improved  pointer  coding  and  accessing,  and  "ramp"  accessed 
memories  in  an  attempt  to  reduce  the  large  amount  of  information  neede 
by  the  Successor  Accessed  Memory  (SAM)  and  decrease  program  execution 
tim.e.  A.daptive  or  predictive  techniques  may  exist  for  certain 
classes  of  program  behavior.  For  e.xar.iple,  inde.xing  mechanisms  could 
be  included  in  the  Successor  .Accessed  Memory  to  avoid  the  large 
access  trees  required  for  arrays.  This  is  a worthy  goal  since  array 
accessing  contributed  the  most  significant  term  in  our  calculations 
of  the  overhead  function  of  the  SAM. 

vi)  Further  refinement  of  the.  estimation  of  H(S)  to  capture  more  infor- 
mation in  the  model  system  itself.  This  could  include  checking  the 
invariance  of  H(S)  for  the  same  algorithm  and  data  on  different 
architectures,  variation  of  program  behavior  across  different  input 
data, and  indexing  information  as  derived  in  v)  above. 
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vii)  Research  on  the  perforrnance  potential  of  an  architecture  with 

multiple  address  and  multiple  computation  processors  operating  in  a 
distributed  "load-sharing"  environment, 
viii)  .-\daptation  of  instruction  sets  and  compilers  to  take  best  advantage 
of  a computer  system  that  uses  "intelligent"  memory. 
iy.)  Extending  this  research  to  evaluate  architectures  with  a separate 
memory  addressing  processor  rather  than  including  this  processor 
with  the  meniot^  (as  in  Chapter  4)  or  with  the  computation  processor 
(as  in  Chapter  3 and  conventionally). 
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